- Introduction
- Group memebers
- About dataset
- Process pipeline
- Model deployment
- Model evaluation
- Discussion
In the assignment of Machine Learning course, we are required to apply models from chapter 2-10, defined and solved problem on a single dataset.
These models include:
-
Decision Tree (Chapter 2)
-
Artifical Neural Network (Chapter 3)
-
Naive Bayes (Chapter 4)
-
Genetic Algorithm (Chapter 5)
-
Graphical Models (Chapter 6)
-
Support Vector Machine (Chapter 7)
-
Dimensional Reduction (Chapter 8)
-
Ensembler Method (Chapter 9)
-
Differential Model (Chapter 10)
| Member name | ID | Assigment 1 | Assigment 2 |
|---|---|---|---|
| Tran Quoc Hieu | 2252217 | Genetic Algorithm | Logistic Regression |
| Tran Quoc Trung | 2252859 | Naives Bayes | Principal Component Analysis |
| Nguyen Anh Khoa | 2252352 | Graphical Models (Manually Naives Bayes) | Ensembler Method |
| Do Quang Hao | 2252180 | Artificial Neural Network | Support Vector Machine |
| Luu Chi Cuong | 2252097 | Decision Tree | Hidden Markov Model |
-
In this assignment, we use the dataset for
sentiment analysistask. -
sentiment analysisis the task to classify sentiment (positive, neural, negative) given a document (a list of sentence). -
This dataset includes training data with 27480 examples and test data with 3534 examples. Each has about 9 columns. For instance, the summary of this dataset is as follows:
| Column | Non-Null Count | Dtype | |
|---|---|---|---|
| 0 | textID | 27480 non-null | object |
| 1 | text | 27480 non-null | object |
| 2 | selected_text | 27480 non-null | object |
| 3 | sentiment | 27480 non-null | object |
| 4 | Time of Tweet | 27480 non-null | object |
| 5 | Age of User | 27480 non-null | object |
| 6 | Country | 27480 non-null | object |
| 7 | Population -2020 | 27480 non-null | float64 |
| 8 | Land Area (Km²) | 27480 non-null | float64 |
| 9 | Density (P/Km²) | 27480 non-null | float64 |
- For sentiment classification task, we only focus on 2 columns
texts(data) andsentiment(label). The distribution of each classes in dataset is as follows:
For each model, we follow the below pipeline:
-
Data preprocessing: This step includes cleaning the text and extracting features from them. In our assignment, we implement 2 kinds of features: Tf-idf and Word Embedding. We shown later characteristics of each feature extration way.
-
Model tuning: we utilizes Optuna, a hyperparameter optimization framework. Model is tuned in training set using cross-validation technique. An abstract class of hyperparameter tuning is setup in a separate directory.
-
Model training: after fining optimizing hyperparameter, the model is then be fitted in training set.
-
Model evaluation: we use the following evaluation metrics:
-
Accuracy: Measures the percentage of correctly predicted labels out of all predictions.
-
F1-Score: A weighted average of precision and recall, useful for imbalanced datasets.
-
AUC-ROC: The Area Under the Receiver Operating Characteristic Curve, which evaluates the model's ability to distinguish between classes.
-
For ease of observation, each model is implemented in jupyter notebook in separated directories inside src/models, the path is as follows:
-
Decision Tree (here)
-
Artificial Neural Network (here)
-
Naives Bayes (here)
-
Genetic Algorithm (here)
-
Graphical Models (here)
For the second assignment, we continue to implement models from Chapter 6 - 10. Detail about each model implementation is put in the README file of the corresponding path as follows:
-
Graphical Model - Hidden Markov model (here)
-
Support Vector Machine (here)
-
Dimensional Reduction - Principal Components Analysis (here)
-
Ensembler Method - Boosting (here)
-
Differential Model - Logistic Regression (here)
We show in this table the result on test set for each model. This includes the input feature (Tf-idf or word embedding) with its size, accuracy, f1-score, ROC-AUC (negative, neural, positive).
| Model | Feature | Accuracy | F1-score | ROC-AUC | |||
|---|---|---|---|---|---|---|---|
| Name | Size | Negative | Neural | Positive | |||
| Decision Tree (DT) | Tf-idf | 25,828 | 0.6919 | 0.6915 | 0.8297 | 0.7743 | 0.8682 |
| Multi-layer Perceptron (MLP) | Tf-idf | 50,000 | 0.6984 | 0.6984 | 0.8672 | 0.7967 | 0.8987 |
| Naive Bayes (NB) | Tf-idf | 1,000 | 0.5068 | 0.5224 | 0.7252 | 0.6844 | 0.8002 |
| 485 (60% variance after applying PCA) | 0.4052 | 0.4052 | 0.6068 | 0.5418 | 0.6212 | ||
| Word embedding | 300 | 0.5351 | 0.5332 | 0.7427 | 0.6572 | 0.7719 | |
| 30 (60% variance after applying PCA) | 0.5976 | 0.5969 | 0.8041 | 0.7055 | 0.8067 | ||
| Genetic Algorithm (GA) | Word embedding | 30 (60% variance after applying PCA) | 0.4046 | 0.2331 | 0.5075 | 0.4082 | 0.3980 |
| Hidden Markov Model (HMM) | Tokens sequence | 0.4434 | 0.4372 | 0.4836 | 0.5388 | 0.4763 | |
| Support Vector Machine (SVM) | Tf-idf | 2,878 (90% variance after applying PCA) | 0.6488 | 0.6509 | 0.8299 | 0.7309 | 0.8549 |
| Emsembler method (EM) | Word embedding | 30 (60% variance after applying PCA) | 0.6178 | 0.6203 | 0.8044 | 0.7404 | 0.8350 |
| Logistic regression (LR) | Tf-idf | 50,000 | 0.6811 | 0.6822 | 0.8638 | 0.7695 | 0.8955 |
| Word embedding | 30 (60% variance after applying PCA) | 0.6347 | 0.6315 | 0.8130 | 0.7378 | 0.8149 | |
In dimensional reduction section, we do not implement a specific model. Instead we introduce a reduced input feature which is used by other model and compared with original one.
a. Input feature:
-
In our project, we extract and use 2 different feature class: tf-idf and word embedding. For each model, we choose to implement the suitable input feature approach. Large input feature is not valid for some models due to the hardware limitation. For example, with EM and GA, they include create a lot of models which cost a lot of space if we choose to implement large tf-idf feature.
-
For HMM, the input feature is sequence of tokens which is different from others due to its different behaviour. It works on the sequence of tokens and determine what the most likely sequence of states that generate this.
b. Size reducing trade-off: In NB and LR, we implement 2 feature extraction approaches, with and without PCA reduction, and analyze the result. For NB, suprisingly, we observe 2 behaviors: Tf-idf is more efficient without PCA, while word embedding gives better result with PCA application. In LR, full tf-idf (50,000 features) and reduced word embedding (30 features) are implemented. The result is reasonable where full tf-idf performs slightly better than reduced word embedding.
c. Hyperparameter tuning: During implementing, we spend a part of data for model validation as well as hyperparameter tuning for best model. For example, we fine-tune the learning rate and number of hidden nodes for MLP and number of estimators as well as max growing depth for generating tree in EM. Tuning shows a non-trivial improvement for some model. On the other hand, for some others, due to hardware limitation, they can not be tuned properly leading to low result (HMM, GA).
d. Result comparision:
-
Among these models, not suprisingly, MLP with 1 hidden layer and 50,000 tf-idf input gives the best result since it has the largest number of parameters. However, as we introduce in the corresponding notebook, this model is not stable since it depends strongly on the initialization point and is easy to overfit the dataset.
-
LR and DT also show high performances compared with others. These are models that we can utilize the full tf-idf feature and tuning, hence give a better result. Also, comparing to MLP, these models are not suffered from overfitting since they include less parameter.
-
On the other hand, SVM and EM, which use the reduced input, give an acceptable result. When implementing them, we observe that using the full input feature is time-costly and space-costly. Hence, reduced feature using PCA is applied in which it shows a reasonable result.
-
Finally, among them, GA and HMM are two models which have lowest performance. For GA, it has the lowest f1-score since it only predicts one label for every samples. Also, it shows an extremely weak ability to distinguish between each class as suggested by the AUC-ROC. Beside poor performance, it is space-expensive since it involves creating many neural network models. For HMM, we suffer the problem of data shorteness. This model requires splitting data and build different HMMs for each class. Hence it is likely to be underfitting, leading to a poor result.
