Skip to content

callmerichie/ChatBotRAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 

Repository files navigation

RAG Chatbot — Bachelor Thesis

Development of a chatbot using Retrieval Augmented Generation (RAG) technique, built during an internship at DataMantix s.r.l., Udine. The system was designed to assist researchers studying Italian vintage cinema by allowing them to query a large corpus of digitized OCR journals through a conversational interface.

Overview

The system combines a vector database with a Large Language Model to retrieve relevant document chunks and generate grounded answers. Users can interact via a web portal, select the embedding model and prompt style, and inspect the source documents behind each answer.

Architecture

  1. PDF documents are loaded, split into chunks, and embedded into a ChromaDB vector store.
  2. A user question is embedded using the same model and a similarity search retrieves the top-k relevant chunks.
  3. The chunks, the question, and the conversation history are passed to an LLM (via Ollama) to generate a response.
  4. The response, the raw document chunks, and their source identifiers are returned to the frontend.

Project Structure

.
├── populate_database.py        # Loads PDFs, splits into chunks, embeds and stores in ChromaDB
├── get_embedding_function.py   # Returns the embedding model (BAAI or OpenAI)
├── query_data.py               # Core RAG query logic, called by the API
├── query_data_parser.py        # Standalone CLI version of the RAG query
├── query_data_test_prompt.py   # Utility to inspect generated prompts without calling the LLM
├── api_chatbot.py              # FastAPI backend exposing the /response endpoint
├── fastapi_prova.py            # FastAPI playground used during development
├── data_retrival.py            # Retrieves specific documents from ChromaDB by ID
└── test_rag.py                 # Automated evaluation tests for RAG responses

Requirements

  • Python 3.10+
  • Ollama with llama3 and/or llama3.1 pulled locally
  • ChromaDB
  • LangChain
  • FastAPI and Uvicorn
  • Pydantic
  • HuggingFace sentence-transformers (for BAAI embeddings)
  • OpenAI Python SDK (optional, for OpenAI embeddings)

Install dependencies:

pip install langchain langchain-community chromadb fastapi uvicorn pydantic sentence-transformers openai

Setup

1. Add your documents

Place PDF files in the data/ directory, organized by topic subfolder if needed.

2. Populate the database

python populate_database.py

To reset and rebuild the database:

python populate_database.py --reset

The database will be saved to a folder named chroma_<engine>_<argument> (e.g. chroma_BAAI_Produzione).

3. Run the API

uvicorn api_chatbot:app --reload

API Usage

POST /response

{
  "answerType": "Precisa",
  "argument": "Produzione",
  "context": [
    { "content": "previous user message", "sentBy": "user" },
    { "content": "previous bot response", "sentBy": "bot" }
  ],
  "engine": "BAAI",
  "question": "Chi era Eitel Monaco?"
}

Response:

{
  "answer": "...",
  "text": ["chunk 1 content", "chunk 2 content", "..."],
  "sources": ["data/file.pdf:3:12", "..."]
}

Parameters

Field Values Description
answerType Precisa, Estesa/Generica Controls the prompt structure
engine BAAI, OpenAi Embedding model to use
argument e.g. Produzione Selects which ChromaDB collection to query

Embedding Models Tested

  • Ollama Nomic-Embed-Text — baseline, used in early experiments
  • TensorflowHub — second reference model
  • spaCy it_core_news_lg — Italian-language model, underperformed
  • BAAI/bge-large-en-v1.5 — best open-source result, used in production
  • JinaAI jina-embeddings-v2-small-en — paid, good results on small DB
  • OpenAI text-embedding-ada-002 — best overall quality, limited by cost

LLM Models Used

  • Mistral 7B — early experiments
  • LLaMA 3.0 — primary model throughout the project
  • LLaMA 3.1 — used in final phase; improved Italian language support and 128k context window

Notes

  • The vector database path is constructed as chroma_<engine>_<argument>, so each combination of embedding model and topic has its own store.
  • Chunk size of 300 characters with 50 overlap produced the best results for specific questions.
  • Conversation history is injected into the prompt to maintain context across turns.
  • Source documents were OCR-scanned journals; OCR quality affects retrieval accuracy.

University of Udine — DMIF, Academic Year 2024-2025

About

Development of a chatbot based on the Retrieval-Augmented Generation (RAG) technique using open-source technologies.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages