Machine Learning project

Group: ML_Capybara

1. Introduction

In the assignment of Machine Learning course, we are required to apply models from chapter 2-10, defined and solved problem on a single dataset.

These models include:

Decision Tree (Chapter 2)
Artifical Neural Network (Chapter 3)
Naive Bayes (Chapter 4)
Genetic Algorithm (Chapter 5)
Graphical Models (Chapter 6)
Support Vector Machine (Chapter 7)
Dimensional Reduction (Chapter 8)
Ensembler Method (Chapter 9)
Differential Model (Chapter 10)

2. Group members and tasks distribution

Member name	ID	Assigment 1	Assigment 2
Tran Quoc Hieu	2252217	Genetic Algorithm	Logistic Regression
Tran Quoc Trung	2252859	Naives Bayes	Principal Component Analysis
Nguyen Anh Khoa	2252352	Graphical Models (Manually Naives Bayes)	Ensembler Method
Do Quang Hao	2252180	Artificial Neural Network	Support Vector Machine
Luu Chi Cuong	2252097	Decision Tree	Hidden Markov Model

3. About dataset

In this assignment, we use the dataset for sentiment analysis task.
sentiment analysis is the task to classify sentiment (positive, neural, negative) given a document (a list of sentence).
This dataset includes training data with 27480 examples and test data with 3534 examples. Each has about 9 columns. For instance, the summary of this dataset is as follows:

	Column	Non-Null Count	Dtype
0	textID	27480 non-null	object
1	text	27480 non-null	object
2	selected_text	27480 non-null	object
3	sentiment	27480 non-null	object
4	Time of Tweet	27480 non-null	object
5	Age of User	27480 non-null	object
6	Country	27480 non-null	object
7	Population -2020	27480 non-null	float64
8	Land Area (Km²)	27480 non-null	float64
9	Density (P/Km²)	27480 non-null	float64

For sentiment classification task, we only focus on 2 columns texts (data) and sentiment (label). The distribution of each classes in dataset is as follows:

4. Process pipeline

For each model, we follow the below pipeline:

Data preprocessing: This step includes cleaning the text and extracting features from them. In our assignment, we implement 2 kinds of features: Tf-idf and Word Embedding. We shown later characteristics of each feature extration way.
Model tuning: we utilizes Optuna, a hyperparameter optimization framework. Model is tuned in training set using cross-validation technique. An abstract class of hyperparameter tuning is setup in a separate directory.
Model training: after fining optimizing hyperparameter, the model is then be fitted in training set.
Model evaluation: we use the following evaluation metrics:
- Accuracy: Measures the percentage of correctly predicted labels out of all predictions.
- F1-Score: A weighted average of precision and recall, useful for imbalanced datasets.
- AUC-ROC: The Area Under the Receiver Operating Characteristic Curve, which evaluates the model's ability to distinguish between classes.

5. Model deployment

Assignment 1

For ease of observation, each model is implemented in jupyter notebook in separated directories inside src/models, the path is as follows:

Decision Tree (here)
Artificial Neural Network (here)
Naives Bayes (here)
Genetic Algorithm (here)
Graphical Models (here)

Assignment 2

For the second assignment, we continue to implement models from Chapter 6 - 10. Detail about each model implementation is put in the README file of the corresponding path as follows:

Graphical Model - Hidden Markov model (here)
Support Vector Machine (here)
Dimensional Reduction - Principal Components Analysis (here)
Ensembler Method - Boosting (here)
Differential Model - Logistic Regression (here)

6. Model evaluation

We show in this table the result on test set for each model. This includes the input feature (Tf-idf or word embedding) with its size, accuracy, f1-score, ROC-AUC (negative, neural, positive).

Model	Feature		Accuracy	F1-score	ROC-AUC
Model	Name	Size	Accuracy	F1-score	Negative	Neural	Positive
Decision Tree (DT)	Tf-idf	25,828	0.6919	0.6915	0.8297	0.7743	0.8682
Multi-layer Perceptron (MLP)	Tf-idf	50,000	0.6984	0.6984	0.8672	0.7967	0.8987
Naive Bayes (NB)	Tf-idf	1,000	0.5068	0.5224	0.7252	0.6844	0.8002
	Tf-idf	485 (60% variance after applying PCA)	0.4052	0.4052	0.6068	0.5418	0.6212
	Word embedding	300	0.5351	0.5332	0.7427	0.6572	0.7719
	Word embedding	30 (60% variance after applying PCA)	0.5976	0.5969	0.8041	0.7055	0.8067
Genetic Algorithm (GA)	Word embedding	30 (60% variance after applying PCA)	0.4046	0.2331	0.5075	0.4082	0.3980
Hidden Markov Model (HMM)	Tokens sequence		0.4434	0.4372	0.4836	0.5388	0.4763
Support Vector Machine (SVM)	Tf-idf	2,878 (90% variance after applying PCA)	0.6488	0.6509	0.8299	0.7309	0.8549
Emsembler method (EM)	Word embedding	30 (60% variance after applying PCA)	0.6178	0.6203	0.8044	0.7404	0.8350
Logistic regression (LR)	Tf-idf	50,000	0.6811	0.6822	0.8638	0.7695	0.8955
Logistic regression (LR)	Word embedding	30 (60% variance after applying PCA)	0.6347	0.6315	0.8130	0.7378	0.8149

In dimensional reduction section, we do not implement a specific model. Instead we introduce a reduced input feature which is used by other model and compared with original one.

7. Discussion

a. Input feature:

In our project, we extract and use 2 different feature class: tf-idf and word embedding. For each model, we choose to implement the suitable input feature approach. Large input feature is not valid for some models due to the hardware limitation. For example, with EM and GA, they include create a lot of models which cost a lot of space if we choose to implement large tf-idf feature.
For HMM, the input feature is sequence of tokens which is different from others due to its different behaviour. It works on the sequence of tokens and determine what the most likely sequence of states that generate this.

b. Size reducing trade-off: In NB and LR, we implement 2 feature extraction approaches, with and without PCA reduction, and analyze the result. For NB, suprisingly, we observe 2 behaviors: Tf-idf is more efficient without PCA, while word embedding gives better result with PCA application. In LR, full tf-idf (50,000 features) and reduced word embedding (30 features) are implemented. The result is reasonable where full tf-idf performs slightly better than reduced word embedding.

c. Hyperparameter tuning: During implementing, we spend a part of data for model validation as well as hyperparameter tuning for best model. For example, we fine-tune the learning rate and number of hidden nodes for MLP and number of estimators as well as max growing depth for generating tree in EM. Tuning shows a non-trivial improvement for some model. On the other hand, for some others, due to hardware limitation, they can not be tuned properly leading to low result (HMM, GA).

d. Result comparision:

Among these models, not suprisingly, MLP with 1 hidden layer and 50,000 tf-idf input gives the best result since it has the largest number of parameters. However, as we introduce in the corresponding notebook, this model is not stable since it depends strongly on the initialization point and is easy to overfit the dataset.
LR and DT also show high performances compared with others. These are models that we can utilize the full tf-idf feature and tuning, hence give a better result. Also, comparing to MLP, these models are not suffered from overfitting since they include less parameter.
On the other hand, SVM and EM, which use the reduced input, give an acceptable result. When implementing them, we observe that using the full input feature is time-costly and space-costly. Hence, reduced feature using PCA is applied in which it shows a reasonable result.
Finally, among them, GA and HMM are two models which have lowest performance. For GA, it has the lowest f1-score since it only predicts one label for every samples. Also, it shows an extremely weak ability to distinguish between each class as suggested by the AUC-ROC. Beside poor performance, it is space-expensive since it involves creating many neural network models. For HMM, we suffer the problem of data shorteness. This model requires splitting data and build different HMMs for each class. Hence it is likely to be underfitting, leading to a poor result.

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
data		data
notebooks/assignment1		notebooks/assignment1
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
image.png		image.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine Learning project

Group: ML_Capybara

Table of Contents

1. Introduction

2. Group members and tasks distribution

3. About dataset

4. Process pipeline

5. Model deployment

Assignment 1

Assignment 2

6. Model evaluation

7. Discussion

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Machine Learning project

Group: ML_Capybara

Table of Contents

1. Introduction

2. Group members and tasks distribution

3. About dataset

4. Process pipeline

5. Model deployment

Assignment 1

Assignment 2

6. Model evaluation

7. Discussion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages