Set up a Python environment for the project.
Use venv or conda to manage dependencies:
python -m venv .venv
source .venv/bin/activate # On macOS/Linux
.venv\Scripts\activate # On Windowsconda create --name book-recommender python=3.9
conda activate book-recommenderRun the following command to install dependencies:
pip install kagglehub pandas matplotlib seaborn python-dotenv \
langchain-community langchain-openai langchain-chroma gradio \
transformers jupyter ipywidgets| Package | Description |
|---|---|
| kagglehub | Access and download datasets from Kaggle easily. |
| pandas | Data manipulation and analysis library, useful for handling structured data. |
| matplotlib | Visualization library for creating static, animated, and interactive plots. |
| seaborn | Statistical data visualization library built on top of matplotlib. |
| python-dotenv | Load environment variables from a .env file to manage secrets securely. |
| langchain-community | Community-supported extensions for working with LangChain. |
| langchain-openai | OpenAI API integration for LangChain applications. |
| langchain-chroma | ChromaDB integration for vector database storage and retrieval. |
| gradio | Create interactive web interfaces for machine learning models easily. |
| transformers | Hugging Face library for working with pre-trained transformer models. |
| jupyter notebook | Interactive computing environment for writing and running Python code. |
| ipywidgets | Interactive widgets for Jupyter notebooks to enhance user experience. |
To start the Jupyter Notebook, run:
jupyter notebookAfter cleaning the data, we will perform vector search and word embeddings to find similarities and dissimilarities between words. The process involves:
-
Creating distance between words that are dissimilar.
-
Relying on word embedding models by analyzing word usage in context.
-
Word2Vec: Learning which words immediately precede and follow a given word.
-
Transforming words into embeddings and adding positional embeddings to determine their position.
-
Feeding these embeddings into a self-attention mechanism to understand word relationships within a sentence.
-
Generating self-attention vectors for each word and averaging them over multiple iterations.
-
This process of generating and normalizing self-attention vectors is called the encoder block.
- Encoder Block: Learns all relationships between words in the source language.
- Sends output to the Decoder, which relates words in the target language and utilizes encoder output to predict the most likely translation.
- Encoder-Only Models (e.g., RoBERTa): Trained to predict a masked word in text.
- Tokenizes the text and adds special
[CLS]and[SEP]tokens to mark the beginning and end. - Applies word embeddings and self-attention in encoder blocks.
- Learns internal representations of language structure to improve accuracy.
- Tokenizes the text and adds special
- Document Embedding: Identifies whether documents are similar or dissimilar based on embeddings.
- We match embeddings to generate book recommendations.
- Currently using a linear search approach.
- Exploring vector indexing databases for grouping similar vectors efficiently.
- Tradeoff exists between speed and accuracy in search optimization.
- LangChain is a Python framework offering various LLM functionalities.
- Used for creating Retrieval-Augmented Generation (RAG) pipelines and chatbots.
- State-of-the-art AI capabilities without being limited to a single LLM provider.
-
Text classification is a branch of NLP that assigns text to categories.
-
Zero-shot classification can categorize books into different groups without labeled training data.
-
Using Hugging Face’s transformers library, we apply zero-shot learning to classify books by genre, topic, or audience.This step helps refine recommendations by filtering books based on user preferences.
- To provide users with an additional degree of control, we fine-tune our LLM to classify emotion.
- We consider the RoBERTa model with its encoder layer.
- Instead of predicting masked words, we replace the last layer with an emotion classification layer.
- This helps categorize books based on emotional tone, improving recommendations.

- We implement a summary vector database that allows us to retrieve the most similar texts based on queries.
- Text classification is used to determine if a book is fiction or non-fiction.
- After classification, we analyze the emotional tone of the book.
- We create an interactive Gradio dashboard, an open-source Python package, to visualize and explore recommendations dynamically.
Wrapped Gradio in FastAPI (required for Vercel).
Feel free to fork this repository and submit pull requests. Contributions are welcome!




