Development of a chatbot using Retrieval Augmented Generation (RAG) technique, built during an internship at DataMantix s.r.l., Udine. The system was designed to assist researchers studying Italian vintage cinema by allowing them to query a large corpus of digitized OCR journals through a conversational interface.
The system combines a vector database with a Large Language Model to retrieve relevant document chunks and generate grounded answers. Users can interact via a web portal, select the embedding model and prompt style, and inspect the source documents behind each answer.
- PDF documents are loaded, split into chunks, and embedded into a ChromaDB vector store.
- A user question is embedded using the same model and a similarity search retrieves the top-k relevant chunks.
- The chunks, the question, and the conversation history are passed to an LLM (via Ollama) to generate a response.
- The response, the raw document chunks, and their source identifiers are returned to the frontend.
.
├── populate_database.py # Loads PDFs, splits into chunks, embeds and stores in ChromaDB
├── get_embedding_function.py # Returns the embedding model (BAAI or OpenAI)
├── query_data.py # Core RAG query logic, called by the API
├── query_data_parser.py # Standalone CLI version of the RAG query
├── query_data_test_prompt.py # Utility to inspect generated prompts without calling the LLM
├── api_chatbot.py # FastAPI backend exposing the /response endpoint
├── fastapi_prova.py # FastAPI playground used during development
├── data_retrival.py # Retrieves specific documents from ChromaDB by ID
└── test_rag.py # Automated evaluation tests for RAG responses
- Python 3.10+
- Ollama with
llama3and/orllama3.1pulled locally - ChromaDB
- LangChain
- FastAPI and Uvicorn
- Pydantic
- HuggingFace
sentence-transformers(for BAAI embeddings) - OpenAI Python SDK (optional, for OpenAI embeddings)
Install dependencies:
pip install langchain langchain-community chromadb fastapi uvicorn pydantic sentence-transformers openai1. Add your documents
Place PDF files in the data/ directory, organized by topic subfolder if needed.
2. Populate the database
python populate_database.pyTo reset and rebuild the database:
python populate_database.py --resetThe database will be saved to a folder named chroma_<engine>_<argument> (e.g. chroma_BAAI_Produzione).
3. Run the API
uvicorn api_chatbot:app --reloadPOST /response
{
"answerType": "Precisa",
"argument": "Produzione",
"context": [
{ "content": "previous user message", "sentBy": "user" },
{ "content": "previous bot response", "sentBy": "bot" }
],
"engine": "BAAI",
"question": "Chi era Eitel Monaco?"
}Response:
{
"answer": "...",
"text": ["chunk 1 content", "chunk 2 content", "..."],
"sources": ["data/file.pdf:3:12", "..."]
}Parameters
| Field | Values | Description |
|---|---|---|
answerType |
Precisa, Estesa/Generica |
Controls the prompt structure |
engine |
BAAI, OpenAi |
Embedding model to use |
argument |
e.g. Produzione |
Selects which ChromaDB collection to query |
- Ollama Nomic-Embed-Text — baseline, used in early experiments
- TensorflowHub — second reference model
- spaCy it_core_news_lg — Italian-language model, underperformed
- BAAI/bge-large-en-v1.5 — best open-source result, used in production
- JinaAI jina-embeddings-v2-small-en — paid, good results on small DB
- OpenAI text-embedding-ada-002 — best overall quality, limited by cost
- Mistral 7B — early experiments
- LLaMA 3.0 — primary model throughout the project
- LLaMA 3.1 — used in final phase; improved Italian language support and 128k context window
- The vector database path is constructed as
chroma_<engine>_<argument>, so each combination of embedding model and topic has its own store. - Chunk size of 300 characters with 50 overlap produced the best results for specific questions.
- Conversation history is injected into the prompt to maintain context across turns.
- Source documents were OCR-scanned journals; OCR quality affects retrieval accuracy.
University of Udine — DMIF, Academic Year 2024-2025