Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
114 changes: 85 additions & 29 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,20 +4,29 @@
[![Python](https://img.shields.io/badge/Python-3.8%2B-blue)](https://www.python.org/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

# Sentiment Analysis

This repository contains a sentiment analysis application that uses TensorFlow and Keras to classify text data into positive or negative sentiments. The application includes a speech-to-text interface (`voice_to_text_app.py`) built with Dash, which allows users to record audio, transcribe it into text, and analyze its sentiment.
# Sentiment Analysis and Translation

This repository contains a sentiment analysis application and an English-to-French translation model. The sentiment analysis application uses TensorFlow and Keras to classify text data into positive or negative sentiments. The translation model implements a Transformer-based architecture for sequence-to-sequence learning.

## Features

### Sentiment Analysis
- **Speech-to-Text**: Converts spoken audio into text using the Vosk library.
- **Text Preprocessing**: Uses TensorFlow's `TextVectorization` layer to tokenize and vectorize text data.
- **Bidirectional LSTM Model**: Implements a deep learning model with embedding, bidirectional LSTM, and dense layers for sentiment classification.
- **Training and Evaluation**: Includes functionality to train the model on a dataset and evaluate its performance on validation and test sets.
- **Inference**: Provides an inference pipeline to predict sentiment for new text inputs.
- **Interactive Application**: A Dash-based web application for real-time speech-to-text and sentiment analysis.
- **Translation Dataset Support**: Processes English-French translation datasets for sequence-to-sequence learning.

### English-to-French Translation
- **Transformer Model**: Implements a sequence-to-sequence Transformer model for English-to-French translation.
- **BLEU Score Evaluation**: Evaluates the quality of translations using the BLEU metric.
- **Preprocessing**: Includes utilities for tokenizing and vectorizing English and French text.
- **Model Saving and Loading**: Supports saving and loading trained Transformer models for reuse.

---

## Installation

### Install Dependencies

Expand All @@ -33,6 +42,8 @@ Then, install the project dependencies:
poetry install
```

---

## Project Structure

```
Expand All @@ -46,22 +57,30 @@ Sentiment_Analysis/
│ ├── models/ # Saved models
│ │ ├── inference_model.keras
│ │ ├── sentiment_keras_binary.keras
│ │ ├── transformer_best_model.keras
│ │ └── optuna_model_binary.json # Hyperparameter optimization results
│ ├── configurations/ # Configuration files
│ │ ├── model_builder_config.json
│ │ ├── model_trainer_config.json
│ │ └── optuna_config.json
│ ├── modules/ # Custom Python modules
│ │ ├── __init__.py # Makes the folder a Python package
│ │ ├── load_data.py # Data loading utilities
│ │ ├── model.py # Model definition and training
│ │ ├── data_preprocess.py # Data preprocessing utilities
│ │ ├── text_vectorizer.py # Text vectorization utilities
│ │ ├── utils.py # Enum classes
│ ├── modules/ # Custom Python modules
│ │ ├── __init__.py # Makes the folder a Python package
│ │ ├── load_data.py # Data loading utilities
│ │ ├── model.py # Model definition and training
│ │ ├── data_preprocess.py # Data preprocessing utilities
│ │ ├── text_vectorizer.py # Text vectorization utilities
│ │ ├── utils.py # Enum classes
│ │ ├── sentiment_analysis_utils.py # Utils functions for sentiment_analysis
│ │ └── speech_to_text.py # Speech-to-text and sentiment analysis logic
│ ├── sentiment_analysis_bert_other.py # Sentiment analysis using BERT
│ └── sentiment_analysis.py # Sentiment analysis pipeline script
│ │ ├── transformer_components.py # Transformer model components
│ │ └── speech_to_text.py # Speech-to-text and sentiment analysis logic
│ ├── scripts/ # Scripts for dataset management and preprocessing
│ │ ├── __init__.py # Marks the directory as a Python package
│ │ ├── loading_kaggle_dataset_utils.py # Utilities for downloading and optimizing Kaggle datasets
│ │ ├── loading_kaggle_dataset_script.py # Script to process Kaggle datasets
│ │ └── README.md # Documentation for the scripts folder
│ ├── translation_french_english.py # English-to-French translation pipeline
│ ├── sentiment_analysis_bert_other.py # Sentiment analysis using BERT
│ └── sentiment_analysis.py # Sentiment analysis pipeline script
├── tests/ # Unit and integration tests
│ └── test_model.py # Tests for speech_to_text.py
Expand All @@ -78,44 +97,81 @@ Sentiment_Analysis/
├── Makefile # Makefile for common tasks
├── pyproject.toml # Poetry configuration file
├── README.md # Project documentation
├── requirements.txt # Optional: pip requirements file
└── ruff.toml # Ruff configuration file
```

---

## Usage

### Sentiment Analysis

1. **Prepare the Dataset**

Place your dataset in the `src/data/` folder. The default dataset used is `tripadvisor_hotel_reviews.csv`.
Place your dataset in the `src/data/` folder. The default dataset used is `tripadvisor_hotel_reviews.csv`.

2. **Train the Model**

Run the main script to train the model:
Run the main script to train the model:

```bash
poetry run python src/sentiment_analysis.py
```
```bash
poetry run python src/sentiment_analysis.py
```

The script will preprocess the data, train the model, and save it in the `src/models/` folder.
The script will preprocess the data, train the model, and save it in the `src/models/` folder.

3. **Inference**

The script includes a test example for inference. Modify the `raw_text_data` variable in `sentiment_analysis.py` to test with your own text input.
The script includes a test example for inference. Modify the `raw_text_data` variable in `sentiment_analysis.py` to test with your own text input.

4. **Evaluate the Model**

The script evaluates the model on the test dataset and prints the accuracy.
The script evaluates the model on the test dataset and prints the accuracy.

Example Output:
Output:

```
Test Acc.: 95.00%
```
```
Test Acc.: 95.00%
```

---

### English-to-French Translation

1. **Prepare the Dataset**

Place your English-French dataset in the `src/data/` folder. The dataset should be in a format compatible with the `DatasetProcessor` class.

2. **Train or Load the Model**

Run the translation script to train or load the Transformer model:

```bash
poetry run python src/translation_french_english.py
```

- If a saved model exists, it will be loaded.
- Otherwise, a new model will be trained and saved in the `src/models/` folder.

3. **Evaluate the Model**

The script evaluates the model on the test dataset and calculates the BLEU score.

Output:

```
Test loss: 2.27, Test accuracy: 65.26%
BLEU score on the test dataset: 0.47
```

---

## Customization

- Modify hyperparameters like `embedding_dim`, `lstm_units`, and `dropout_rate` in `src/modules/model.py`.
- Replace the dataset in `src/data/` with your own CSV file.
- Modify hyperparameters like `embed_dim`, `dense_dim`, and `num_heads` in `src/translation_french_english.py` for the Transformer model.
- Replace the dataset in `src/data/` with your own English-French dataset.

---

## License

Expand Down
File renamed without changes.
File renamed without changes.
2 changes: 1 addition & 1 deletion src/modules/sentiment_analysis_utils.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from modules.model import ModelBuilder, ModelTrainer, OptunaOptimizer
from modules.model_sentiment_analysis import ModelBuilder, ModelTrainer, OptunaOptimizer
from modules.utils import ModelPaths, OptunaPaths
import os
import tensorflow as tf
Expand Down
153 changes: 135 additions & 18 deletions src/modules/transformer_components.py
Original file line number Diff line number Diff line change
@@ -1,66 +1,117 @@
import tensorflow as tf
import logging
from nltk.translate.bleu_score import corpus_bleu, SmoothingFunction


@tf.keras.utils.register_keras_serializable(package="Custom")
class PositionalEmbedding(tf.keras.layers.Layer):
def __init__(self, sequence_length, vocab_size, embed_dim):
super().__init__()
def __init__(self, sequence_length, vocab_size, embed_dim, **kwargs):
"""
Initialize the PositionalEmbedding layer.

Args:
sequence_length (int): Maximum sequence length.
vocab_size (int): Vocabulary size.
embed_dim (int): Embedding dimension.
kwargs: Additional keyword arguments for the parent class.
"""
super().__init__(**kwargs)
self.sequence_length = sequence_length
self.vocab_size = vocab_size
self.embed_dim = embed_dim

def build(self, input_shape):
self.token_embeddings = tf.keras.layers.Embedding(
input_dim=vocab_size, output_dim=embed_dim
input_dim=self.vocab_size, output_dim=self.embed_dim
)
self.position_embeddings = tf.keras.layers.Embedding(
input_dim=sequence_length, output_dim=embed_dim
input_dim=self.sequence_length, output_dim=self.embed_dim
)
self.sequence_length = sequence_length
self.embed_dim = embed_dim
super().build(input_shape)

def call(self, inputs):
positions = tf.range(start=0, limit=tf.shape(inputs)[-1], delta=1)
embedded_tokens = self.token_embeddings(inputs)
embedded_positions = self.position_embeddings(positions)
return embedded_tokens + embedded_positions

def get_config(self):
config = super().get_config()
config.update(
{
"sequence_length": self.sequence_length,
"vocab_size": self.vocab_size,
"embed_dim": self.embed_dim,
}
)
return config


@tf.keras.utils.register_keras_serializable(package="Custom")
class TransformerEncoder(tf.keras.layers.Layer):
def __init__(self, embed_dim, dense_dim, num_heads):
super().__init__()
def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
super().__init__(**kwargs)
self.embed_dim = embed_dim
self.dense_dim = dense_dim
self.num_heads = num_heads

def build(self, input_shape):
self.attention = tf.keras.layers.MultiHeadAttention(
num_heads=num_heads, key_dim=embed_dim
num_heads=self.num_heads, key_dim=self.embed_dim
)
self.dense_proj = tf.keras.Sequential(
[
tf.keras.layers.Dense(dense_dim, activation="gelu"),
tf.keras.layers.Dense(embed_dim),
tf.keras.layers.Dense(self.dense_dim, activation="gelu"),
tf.keras.layers.Dense(self.embed_dim),
]
)
self.layernorm_1 = tf.keras.layers.LayerNormalization()
self.layernorm_2 = tf.keras.layers.LayerNormalization()
super().build(input_shape)

def call(self, inputs, mask=None):
attention_output = self.attention(inputs, inputs, attention_mask=mask)
proj_input = self.layernorm_1(inputs + attention_output)
proj_output = self.dense_proj(proj_input)
return self.layernorm_2(proj_input + proj_output)

def get_config(self):
config = super().get_config()
config.update(
{
"embed_dim": self.embed_dim,
"dense_dim": self.dense_dim,
"num_heads": self.num_heads,
}
)
return config


@tf.keras.utils.register_keras_serializable(package="Custom")
class TransformerDecoder(tf.keras.layers.Layer):
def __init__(self, embed_dim, dense_dim, num_heads):
super().__init__()
def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
super().__init__(**kwargs)
self.embed_dim = embed_dim
self.dense_dim = dense_dim
self.num_heads = num_heads

def build(self, input_shape):
self.attention_1 = tf.keras.layers.MultiHeadAttention(
num_heads=num_heads, key_dim=embed_dim
num_heads=self.num_heads, key_dim=self.embed_dim
)
self.attention_2 = tf.keras.layers.MultiHeadAttention(
num_heads=num_heads, key_dim=embed_dim
num_heads=self.num_heads, key_dim=self.embed_dim
)
self.dense_proj = tf.keras.Sequential(
[
tf.keras.layers.Dense(dense_dim, activation="gelu"),
tf.keras.layers.Dense(embed_dim),
tf.keras.layers.Dense(self.dense_dim, activation="gelu"),
tf.keras.layers.Dense(self.embed_dim),
]
)
self.layernorm_1 = tf.keras.layers.LayerNormalization()
self.layernorm_2 = tf.keras.layers.LayerNormalization()
self.layernorm_3 = tf.keras.layers.LayerNormalization()
self.supports_masking = True
super().build(input_shape)

def call(self, inputs, encoder_outputs, mask=None):
causal_mask = self.get_causal_attention_mask(inputs)
Expand Down Expand Up @@ -95,3 +146,69 @@ def get_causal_attention_mask(self, inputs):
axis=0,
)
return tf.tile(mask, mult)

def get_config(self):
config = super().get_config()
config.update(
{
"embed_dim": self.embed_dim,
"dense_dim": self.dense_dim,
"num_heads": self.num_heads,
}
)
return config


def evaluate_bleu(model, dataset, preprocessor):
"""
Evaluate the BLEU score for the model on the given dataset.

Args:
model (tf.keras.Model): The trained Transformer model.
dataset (tf.data.Dataset): The dataset to evaluate.
preprocessor (TextPreprocessor): The text preprocessor for decoding.

Returns:
float: The BLEU score for the dataset.
"""
logging.info("Starting BLEU score evaluation.")
references = []
candidates = []
smoothing_function = SmoothingFunction().method1

# Get the vocabulary from the target vectorization layer
vocab = preprocessor.target_vectorization.get_vocabulary()
index_to_word = {i: word for i, word in enumerate(vocab)}

for batch in dataset:
inputs, targets = batch
# Generate predictions
predictions = model.predict(inputs, verbose=0)

# Decode predictions and targets
for i in range(len(predictions)):
# Decode predicted sentence
pred_tokens = predictions[i].argmax(axis=-1) # Get token IDs
pred_sentence = " ".join(
[
index_to_word[token] for token in pred_tokens if token != 0
] # Ignore padding tokens
)

# Decode reference sentence
ref_tokens = targets[i].numpy() # Get token IDs
ref_sentence = " ".join(
[
index_to_word[token] for token in ref_tokens if token != 0
] # Ignore padding tokens
)

candidates.append(pred_sentence)
references.append([ref_sentence])

# Calculate BLEU score
bleu_score = corpus_bleu(
references, candidates, smoothing_function=smoothing_function
)
logging.info(f"BLEU score evaluation completed: {bleu_score:.4f}")
return bleu_score
Loading