jvachier · jvachier · Apr 21, 2025 · Apr 21, 2025 · Apr 21, 2025 · Apr 21, 2025
@@ -4,20 +4,29 @@
 [![Python](https://img.shields.io/badge/Python-3.8%2B-blue)](https://www.python.org/)
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
 
-# Sentiment Analysis
-
-This repository contains a sentiment analysis application that uses TensorFlow and Keras to classify text data into positive or negative sentiments. The application includes a speech-to-text interface (`voice_to_text_app.py`) built with Dash, which allows users to record audio, transcribe it into text, and analyze its sentiment.
+# Sentiment Analysis and Translation
 
+This repository contains a sentiment analysis application and an English-to-French translation model. The sentiment analysis application uses TensorFlow and Keras to classify text data into positive or negative sentiments. The translation model implements a Transformer-based architecture for sequence-to-sequence learning.
 
 ## Features
 
+### Sentiment Analysis
 - **Speech-to-Text**: Converts spoken audio into text using the Vosk library.
 - **Text Preprocessing**: Uses TensorFlow's `TextVectorization` layer to tokenize and vectorize text data.
 - **Bidirectional LSTM Model**: Implements a deep learning model with embedding, bidirectional LSTM, and dense layers for sentiment classification.
 - **Training and Evaluation**: Includes functionality to train the model on a dataset and evaluate its performance on validation and test sets.
 - **Inference**: Provides an inference pipeline to predict sentiment for new text inputs.
 - **Interactive Application**: A Dash-based web application for real-time speech-to-text and sentiment analysis.
-- **Translation Dataset Support**: Processes English-French translation datasets for sequence-to-sequence learning.
+
+### English-to-French Translation
+- **Transformer Model**: Implements a sequence-to-sequence Transformer model for English-to-French translation.
+- **BLEU Score Evaluation**: Evaluates the quality of translations using the BLEU metric.
+- **Preprocessing**: Includes utilities for tokenizing and vectorizing English and French text.
+- **Model Saving and Loading**: Supports saving and loading trained Transformer models for reuse.
+
+---
+
+## Installation
 
 ### Install Dependencies
 
@@ -33,6 +42,8 @@ Then, install the project dependencies:
 poetry install
 ```
 
+---
+
 ## Project Structure
 
 ```
@@ -46,22 +57,30 @@ Sentiment_Analysis/
 │   ├── models/                     # Saved models
 │   │   ├── inference_model.keras
 │   │   ├── sentiment_keras_binary.keras
+│   │   ├── transformer_best_model.keras
 │   │   └── optuna_model_binary.json # Hyperparameter optimization results
 │   ├── configurations/             # Configuration files
 │   │   ├── model_builder_config.json
 │   │   ├── model_trainer_config.json
 │   │   └── optuna_config.json
-│   ├── modules/                    # Custom Python modules
-│   │   ├── __init__.py             # Makes the folder a Python package
-│   │   ├── load_data.py            # Data loading utilities
-│   │   ├── model.py                # Model definition and training
-│   │   ├── data_preprocess.py      # Data preprocessing utilities
-│   │   ├── text_vectorizer.py      # Text vectorization utilities
-│   │   ├── utils.py                # Enum classes
+│   ├── modules/                        # Custom Python modules
+│   │   ├── __init__.py                 # Makes the folder a Python package
+│   │   ├── load_data.py                # Data loading utilities
+│   │   ├── model.py                    # Model definition and training
+│   │   ├── data_preprocess.py          # Data preprocessing utilities
+│   │   ├── text_vectorizer.py          # Text vectorization utilities
+│   │   ├── utils.py                    # Enum classes
 │   │   ├── sentiment_analysis_utils.py # Utils functions for sentiment_analysis
-│   │   └── speech_to_text.py       # Speech-to-text and sentiment analysis logic
-│   ├── sentiment_analysis_bert_other.py # Sentiment analysis using BERT
-│   └── sentiment_analysis.py       # Sentiment analysis pipeline script
+│   │   ├── transformer_components.py   # Transformer model components
+│   │   └── speech_to_text.py           # Speech-to-text and sentiment analysis logic
+│   ├── scripts/                                # Scripts for dataset management and preprocessing
+│   │   ├── __init__.py                         # Marks the directory as a Python package
+│   │   ├── loading_kaggle_dataset_utils.py     # Utilities for downloading and optimizing Kaggle datasets
+│   │   ├── loading_kaggle_dataset_script.py    # Script to process Kaggle datasets
+│   │   └── README.md                           # Documentation for the scripts folder
+│   ├── translation_french_english.py       # English-to-French translation pipeline
+│   ├── sentiment_analysis_bert_other.py    # Sentiment analysis using BERT
+│   └── sentiment_analysis.py               # Sentiment analysis pipeline script
 │
 ├── tests/                          # Unit and integration tests
 │   └── test_model.py               # Tests for speech_to_text.py
@@ -78,44 +97,81 @@ Sentiment_Analysis/
 ├── Makefile                        # Makefile for common tasks
 ├── pyproject.toml                  # Poetry configuration file
 ├── README.md                       # Project documentation
-├── requirements.txt                # Optional: pip requirements file
 └── ruff.toml                       # Ruff configuration file
 ```
 
+---
+
 ## Usage
 
+### Sentiment Analysis
+
 1. **Prepare the Dataset**
 
-Place your dataset in the `src/data/` folder. The default dataset used is `tripadvisor_hotel_reviews.csv`.
+   Place your dataset in the `src/data/` folder. The default dataset used is `tripadvisor_hotel_reviews.csv`.
 
 2. **Train the Model**
 
-Run the main script to train the model:
+   Run the main script to train the model:
 
-```bash
-poetry run python src/sentiment_analysis.py
-```
+   ```bash
+   poetry run python src/sentiment_analysis.py
+   ```
 
-The script will preprocess the data, train the model, and save it in the `src/models/` folder.
+   The script will preprocess the data, train the model, and save it in the `src/models/` folder.
 
 3. **Inference**
 
-The script includes a test example for inference. Modify the `raw_text_data` variable in `sentiment_analysis.py` to test with your own text input.
+   The script includes a test example for inference. Modify the `raw_text_data` variable in `sentiment_analysis.py` to test with your own text input.
 
 4. **Evaluate the Model**
 
-The script evaluates the model on the test dataset and prints the accuracy.
+   The script evaluates the model on the test dataset and prints the accuracy.
 
-Example Output:
+   Output:
 
-```
-Test Acc.: 95.00%
-```
+   ```
+   Test Acc.: 95.00%
+   ```
+
+---
+
+### English-to-French Translation
+
+1. **Prepare the Dataset**
+
+   Place your English-French dataset in the `src/data/` folder. The dataset should be in a format compatible with the `DatasetProcessor` class.
+
+2. **Train or Load the Model**
+
+   Run the translation script to train or load the Transformer model:
+
+   ```bash
+   poetry run python src/translation_french_english.py
+   ```
+
+   - If a saved model exists, it will be loaded.
+   - Otherwise, a new model will be trained and saved in the `src/models/` folder.
+
+3. **Evaluate the Model**
+
+   The script evaluates the model on the test dataset and calculates the BLEU score.
+
+   Output:
+
+   ```
+   Test loss: 2.27, Test accuracy: 65.26%
+   BLEU score on the test dataset: 0.47
+   ```
+
+---
 
 ## Customization
 
-- Modify hyperparameters like `embedding_dim`, `lstm_units`, and `dropout_rate` in `src/modules/model.py`.
-- Replace the dataset in `src/data/` with your own CSV file.
+- Modify hyperparameters like `embed_dim`, `dense_dim`, and `num_heads` in `src/translation_french_english.py` for the Transformer model.
+- Replace the dataset in `src/data/` with your own English-French dataset.
+
+---
 
 ## License
 

@@ -1,4 +1,4 @@
-from modules.model import ModelBuilder, ModelTrainer, OptunaOptimizer
+from modules.model_sentiment_analysis import ModelBuilder, ModelTrainer, OptunaOptimizer
 from modules.utils import ModelPaths, OptunaPaths
 import os
 import tensorflow as tf

@@ -1,66 +1,117 @@
 import tensorflow as tf
+import logging
+from nltk.translate.bleu_score import corpus_bleu, SmoothingFunction
 
 
+@tf.keras.utils.register_keras_serializable(package="Custom")
 class PositionalEmbedding(tf.keras.layers.Layer):
-    def __init__(self, sequence_length, vocab_size, embed_dim):
-        super().__init__()
+    def __init__(self, sequence_length, vocab_size, embed_dim, **kwargs):
+        """
+        Initialize the PositionalEmbedding layer.
+
+        Args:
+            sequence_length (int): Maximum sequence length.
+            vocab_size (int): Vocabulary size.
+            embed_dim (int): Embedding dimension.
+            kwargs: Additional keyword arguments for the parent class.
+        """
+        super().__init__(**kwargs)
+        self.sequence_length = sequence_length
+        self.vocab_size = vocab_size
+        self.embed_dim = embed_dim
+
+    def build(self, input_shape):
         self.token_embeddings = tf.keras.layers.Embedding(
-            input_dim=vocab_size, output_dim=embed_dim
+            input_dim=self.vocab_size, output_dim=self.embed_dim
         )
         self.position_embeddings = tf.keras.layers.Embedding(
-            input_dim=sequence_length, output_dim=embed_dim
+            input_dim=self.sequence_length, output_dim=self.embed_dim
         )
-        self.sequence_length = sequence_length
-        self.embed_dim = embed_dim
+        super().build(input_shape)
 
     def call(self, inputs):
         positions = tf.range(start=0, limit=tf.shape(inputs)[-1], delta=1)
         embedded_tokens = self.token_embeddings(inputs)
         embedded_positions = self.position_embeddings(positions)
         return embedded_tokens + embedded_positions
 
+    def get_config(self):
+        config = super().get_config()
+        config.update(
+            {
+                "sequence_length": self.sequence_length,
+                "vocab_size": self.vocab_size,
+                "embed_dim": self.embed_dim,
+            }
+        )
+        return config
 
+
+@tf.keras.utils.register_keras_serializable(package="Custom")
 class TransformerEncoder(tf.keras.layers.Layer):
-    def __init__(self, embed_dim, dense_dim, num_heads):
-        super().__init__()
+    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
+        super().__init__(**kwargs)
+        self.embed_dim = embed_dim
+        self.dense_dim = dense_dim
+        self.num_heads = num_heads
+
+    def build(self, input_shape):
         self.attention = tf.keras.layers.MultiHeadAttention(
-            num_heads=num_heads, key_dim=embed_dim
+            num_heads=self.num_heads, key_dim=self.embed_dim
         )
         self.dense_proj = tf.keras.Sequential(
             [
-                tf.keras.layers.Dense(dense_dim, activation="gelu"),
-                tf.keras.layers.Dense(embed_dim),
+                tf.keras.layers.Dense(self.dense_dim, activation="gelu"),
+                tf.keras.layers.Dense(self.embed_dim),
             ]
         )
         self.layernorm_1 = tf.keras.layers.LayerNormalization()
         self.layernorm_2 = tf.keras.layers.LayerNormalization()
+        super().build(input_shape)
 
     def call(self, inputs, mask=None):
         attention_output = self.attention(inputs, inputs, attention_mask=mask)
         proj_input = self.layernorm_1(inputs + attention_output)
         proj_output = self.dense_proj(proj_input)
         return self.layernorm_2(proj_input + proj_output)
 
+    def get_config(self):
+        config = super().get_config()
+        config.update(
+            {
+                "embed_dim": self.embed_dim,
+                "dense_dim": self.dense_dim,
+                "num_heads": self.num_heads,
+            }
+        )
+        return config
+
 
+@tf.keras.utils.register_keras_serializable(package="Custom")
 class TransformerDecoder(tf.keras.layers.Layer):
-    def __init__(self, embed_dim, dense_dim, num_heads):
-        super().__init__()
+    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
+        super().__init__(**kwargs)
+        self.embed_dim = embed_dim
+        self.dense_dim = dense_dim
+        self.num_heads = num_heads
+
+    def build(self, input_shape):
         self.attention_1 = tf.keras.layers.MultiHeadAttention(
-            num_heads=num_heads, key_dim=embed_dim
+            num_heads=self.num_heads, key_dim=self.embed_dim
         )
         self.attention_2 = tf.keras.layers.MultiHeadAttention(
-            num_heads=num_heads, key_dim=embed_dim
+            num_heads=self.num_heads, key_dim=self.embed_dim
         )
         self.dense_proj = tf.keras.Sequential(
             [
-                tf.keras.layers.Dense(dense_dim, activation="gelu"),
-                tf.keras.layers.Dense(embed_dim),
+                tf.keras.layers.Dense(self.dense_dim, activation="gelu"),
+                tf.keras.layers.Dense(self.embed_dim),
             ]
         )
         self.layernorm_1 = tf.keras.layers.LayerNormalization()
         self.layernorm_2 = tf.keras.layers.LayerNormalization()
         self.layernorm_3 = tf.keras.layers.LayerNormalization()
-        self.supports_masking = True
+        super().build(input_shape)
 
     def call(self, inputs, encoder_outputs, mask=None):
         causal_mask = self.get_causal_attention_mask(inputs)
@@ -95,3 +146,69 @@ def get_causal_attention_mask(self, inputs):
             axis=0,
         )
         return tf.tile(mask, mult)
+
+    def get_config(self):
+        config = super().get_config()
+        config.update(
+            {
+                "embed_dim": self.embed_dim,
+                "dense_dim": self.dense_dim,
+                "num_heads": self.num_heads,
+            }
+        )
+        return config
+
+
+def evaluate_bleu(model, dataset, preprocessor):
+    """
+    Evaluate the BLEU score for the model on the given dataset.
+
+    Args:
+        model (tf.keras.Model): The trained Transformer model.
+        dataset (tf.data.Dataset): The dataset to evaluate.
+        preprocessor (TextPreprocessor): The text preprocessor for decoding.
+
+    Returns:
+        float: The BLEU score for the dataset.
+    """
+    logging.info("Starting BLEU score evaluation.")
+    references = []
+    candidates = []
+    smoothing_function = SmoothingFunction().method1
+
+    # Get the vocabulary from the target vectorization layer
+    vocab = preprocessor.target_vectorization.get_vocabulary()
+    index_to_word = {i: word for i, word in enumerate(vocab)}
+
+    for batch in dataset:
+        inputs, targets = batch
+        # Generate predictions
+        predictions = model.predict(inputs, verbose=0)
+
+        # Decode predictions and targets
+        for i in range(len(predictions)):
+            # Decode predicted sentence
+            pred_tokens = predictions[i].argmax(axis=-1)  # Get token IDs
+            pred_sentence = " ".join(
+                [
+                    index_to_word[token] for token in pred_tokens if token != 0
+                ]  # Ignore padding tokens
+            )
+
+            # Decode reference sentence
+            ref_tokens = targets[i].numpy()  # Get token IDs
+            ref_sentence = " ".join(
+                [
+                    index_to_word[token] for token in ref_tokens if token != 0
+                ]  # Ignore padding tokens
+            )
+
+            candidates.append(pred_sentence)
+            references.append([ref_sentence])
+
+    # Calculate BLEU score
+    bleu_score = corpus_bleu(
+        references, candidates, smoothing_function=smoothing_function
+    )
+    logging.info(f"BLEU score evaluation completed: {bleu_score:.4f}")
+    return bleu_score