Data Visualization and Text Mining

Università Cattolica del Sacro Cuore Academic Year 2025-2026

About

This repository contains notebooks and materials for the Data Visualization and Text Mining course. Notebooks will be published the day before each lesson.

Course Contents

01 - NLP In Practice

Introduction to Natural Language Processing using spaCy and NLTK.

NLP Basics and pipeline overview
Tokenization, Stemming, Lemmatization
Stop Words
Part of Speech (POS) Tagging and visualization
Named Entity Recognition (NER) and visualization with displacy
Exploratory Data Analysis on Text Data

02 - Text Classification

Machine Learning applied to text data.

Scikit-Learn basics
N-grams
Bag-of-Words, CountVectorizer, TfidfVectorizer
Text Classification project (end-to-end)

03 - Neural Networks

Introduction to Deep Learning for NLP.

Simple Classifier with Keras
PyTorch Simple Classifier

04 - Embeddings

Word vectors and text representation.

Feature Engineering for Text Data
Word2Vec (CBOW and Skip-gram)
GloVe embeddings
Sentence-Transformers (modern embeddings)

05 - LSTM

Recurrent Neural Networks for NLP.

Text Generation with Neural Networks
LSTM for Sentiment Classification
LSTM for Named Entity Recognition (BiLSTM on CoNLL-2003)

06 - Topic Modeling

Unsupervised methods for discovering topics in text.

Latent Dirichlet Allocation (LDA)
Non-Negative Matrix Factorization (NMF)
Topic Model Evaluation (coherence, perplexity, diversity)
BERTopic (modern embedding-based approach)

07 - Data Visualization

Comprehensive guide to data visualization in Python.

Matplotlib basics and line plots
Area plots, histograms, and bar charts
Pie charts, box plots, scatter plots, and bubble plots
Waffle charts, word clouds, and regression plots
Generating maps in Python
Plotly basics for interactive visualizations

08 - Dashboards

Interactive dashboard development with Dash.

Layout creation with HTML and Dash Bootstrap Components
Navigation bars and cards
HTML and core components
Tables and interactive elements
Callbacks (basic, multiple inputs/outputs, chained, with State)
Real-world applications: COVID dashboard, sales app, NLP Q&A app

09 - Transfer Learning

Practical demonstrations of transfer learning in NLP.

Transfer Learning Demo: Comparing three approaches on BERT
- Training from scratch (random initialization)
- Feature extraction (frozen BERT + classifier)
- Fine-tuning (end-to-end adaptation)
- Performance comparison and visualization
ULMFit Experiment: Three-step transfer learning process
- General-domain language model pre-training (simulated)
- Target task language model fine-tuning
- Classifier training with gradual unfreezing
- Discriminative fine-tuning with layer-wise learning rates

10 - Attention Mechanisms

Understanding attention in neural networks.

Bahdanau Attention (simplified implementation)
Visualization of attention weights
Seq2seq models with attention

11 - Transformers

Introduction to Transformer architecture and Hugging Face ecosystem.

Transformer anatomy and self-attention mechanism
Multi-head attention visualization
Hugging Face Transformers library introduction
Working with pre-trained models
Tokenizers and model pipelines

12 - BERT (Encoder Models)

Advanced applications with BERT-based models.

Text Classification with Transformers
- Fine-tuning BERT for sentiment analysis
- Model evaluation and comparison
- Comprehensive metrics (Accuracy, Precision, Recall, F1, ROC)
Question Answering with fine-tuned BERT
- SQuAD dataset and QA task
- Extractive QA implementation
- Exact Match (EM) and F1 metrics
- Confidence scores and answer extraction

13 - GPT (Decoder Models)

Text generation with GPT models.

Autoregressive text generation
Sampling strategies (temperature, top-k, top-p)
Generation metrics
- Perplexity (model confidence)
- Diversity metrics (Distinct-n, Self-BLEU)
- BLEU and ROUGE scores
BERT vs GPT comparison
Controlled generation techniques

14 - RAG & Modern LLMs

Real-world applications combining retrieval and generation.

RAG Pipeline (NLP14_1_RAG_Pipeline.ipynb)
- Building knowledge bases with vector databases (FAISS)
- Semantic search with embeddings
- Retrieval-Augmented Generation architecture
- Retrieval metrics (Precision@K, Recall@K, MRR, NDCG)
- Generation quality metrics (faithfulness, relevance)
- End-to-end performance analysis
Modern LLM APIs (NLP14_2_Modern_LLMs.ipynb)
- Working with GPT-4, Claude, and other modern LLMs
- Prompt engineering techniques
- Few-shot learning
- Cost-performance optimization
- LLM evaluation frameworks
- Production best practices

15 - AI Agents

Autonomous systems that use LLMs with tools to complete complex tasks.

ReAct Agents - Reasoning + Acting in iterative loops
Agent Components
- LLM brain for decision-making
- Tool integration (APIs, calculators, databases)
- Memory systems (short-term and long-term)
- Multi-step planning and execution
Practical Implementation
- Building agents with OpenAI function calling
- Building agents with Gemini function calling (FREE)
- Calculator, Wikipedia search, weather, and datetime tools
- Agent with conversation memory
- Error handling and best practices

🆓 FREE API Access for Students

Many modern LLM notebooks require paid API keys (OpenAI, Anthropic), which can be a barrier for students. We've created free alternatives using Google Gemini that require no credit card!

Gemini-Based Notebooks

Three complete notebooks using 100% FREE Google Gemini API:

Modern LLMs with Gemini (14-RAG/NLP14_2_Modern_LLMs_Gemini.ipynb)
- Prompt engineering, few-shot learning, evaluation
- No credit card required
AI Agents with Gemini (15-AI_Agent/NLP15_1_AI_Agent_Gemini.ipynb)
- Autonomous agents with tool use
- Function calling, memory, multi-step reasoning

Getting Your Free Gemini API Key

Go to Google AI Studio
Sign in with your Google account
Click "Get API Key" → "Create API key"
Copy your key and use it in the notebooks

Free Tier Limits:

60 requests per minute
1,500 requests per day
1 million tokens per month
More than enough for all coursework!

Getting Started

Clone the Repository

git clone <repository-url>
cd text-mining-dataviz-aa2526

Stay Updated

Since new content is added before each lesson, make sure to pull the latest changes regularly:

git pull origin main

Tip: Run this command before starting a lesson to get the new notebooks.

Open Notebooks in Google Colab

You can open any notebook directly in Google Colab without cloning the repository:

Navigate to the notebook file on GitHub
Replace github.com in the URL with colab.research.google.com/github

Example:

# Original GitHub URL
https://github.com/nluninja/text-mining-dataviz-aa2526/blob/main/notebook.ipynb

# Colab URL
https://colab.research.google.com/github/nluninja/text-mining-dataviz-aa2526/blob/main/notebook.ipynb

Alternatively, from Google Colab:

Go to colab.research.google.com
Click File → Open notebook
Select the GitHub tab
Enter the repository URL and select the notebook you want to open

Note: Changes made in Colab are saved to your Google Drive, not to the repository. To keep your work, save a copy to your Drive.

Workflow for Each Lesson

Before the lesson: Pull the latest updates
```
git pull origin main
```
During the lesson: Work through the notebooks
After the lesson: Your local changes won't conflict with future updates as long as you don't modify the original files. If you want to experiment, consider creating a copy of the notebook.

Handling Conflicts

If you've made changes to a notebook and encounter conflicts when pulling:

# Option 1: Stash your changes, pull, then reapply
git stash
git pull origin main
git stash pop

# Option 2: Discard local changes and get the latest version
git checkout -- .
git pull origin main

Project Track

The project-track/ folder contains information and resources for the final course project:

README.md - Complete project requirements and guidelines
datasets/ - Collection of 10 curated datasets for text classification and entity extraction tasks
- 5 Text Classification datasets (Sentiment Analysis, News Categorization, Spam Detection, etc.)
- 5 Entity Extraction datasets (NER, Product Attributes, Medical Entities, etc.)
- LLM_USE_CASES.md - Optional guidance for using LLMs in your project

Project Requirements Summary

Students must build a text processing pipeline that includes:

Data Exploratory Analysis with visualizations
Neural Network approach (LSTM, BiLSTM, RNN, etc.)
Transformer-based approach (BERT, etc.)
Model comparison and metrics
Interactive dashboard combining all aspects

See the project-track README for full details and submission guidelines.

Tools

The tools/ folder contains helpful resources and tutorials:

Git-Quick-Tutorial.md - A quick reference guide for Git commands and workflows
Git-Exercises.md - Practice exercises to reinforce Git skills

License

This project is licensed under the MIT License - see the LICENSE file for details.

This repository is intended for academic and educational purposes.

Contact

For questions about the course materials, please contact the instructor:

Andrea Belli - andrea.belli@unicatt.it

Name		Name	Last commit message	Last commit date
Latest commit History 142 Commits
01-NLP_In_Practice		01-NLP_In_Practice
02-Text_Classification		02-Text_Classification
03-Neural_Networks		03-Neural_Networks
04-Embeddings		04-Embeddings
05-LSTM		05-LSTM
06-Topic_Modeling		06-Topic_Modeling
07-Data_Visualization		07-Data_Visualization
08-Dashboards		08-Dashboards
09-Transfer_Learning		09-Transfer_Learning
10-Attention		10-Attention
11-Transformers		11-Transformers
12-BERT		12-BERT
13-GPT		13-GPT
14-RAG		14-RAG
15-AI_Agent		15-AI_Agent
project-track		project-track
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
NOTEBOOK_STRUCTURE.md		NOTEBOOK_STRUCTURE.md
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Data Visualization and Text Mining

About

Course Contents

01 - NLP In Practice

02 - Text Classification

03 - Neural Networks

04 - Embeddings

05 - LSTM

06 - Topic Modeling

07 - Data Visualization

08 - Dashboards

09 - Transfer Learning

10 - Attention Mechanisms

11 - Transformers

12 - BERT (Encoder Models)

13 - GPT (Decoder Models)

14 - RAG & Modern LLMs

15 - AI Agents

🆓 FREE API Access for Students

Gemini-Based Notebooks

Getting Your Free Gemini API Key

Getting Started

Clone the Repository

Stay Updated

Open Notebooks in Google Colab

Workflow for Each Lesson

Handling Conflicts

Project Track

Project Requirements Summary

Tools

License

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages