Betawi Multilabel Text Classification

Overview

This project focuses on performing multilabel text classification on a dataset of sentences in the Betawi dialect, a language spoken in Jakarta, Indonesia. The dataset contains sentences with associated sentiment labels across six categories: fuel, machine, others, part, price, and service. Each category can have a sentiment of positive, negative, or neutral. The project involves data preprocessing, exploratory data analysis (EDA), and the application of machine learning models (Naive Bayes, SVM, and KNN) to classify sentiments across these categories.

Objectives:

Preprocess and clean the dataset to ensure data quality and consistency.
Conduct EDA to understand the distribution of sentiments across categories.
Develop and evaluate multilabel classification models to predict sentiments for each category based on textual input.
Support sentiment analysis for Betawi text, enabling insights into public opinions about vehicles and related services.

Dataset

The dataset is split into three CSV files: train_preprocess.csv, valid_preprocess.csv, and test_preprocess.csv, with the following details:

Train Data: 810 rows, 7 columns
Validation Data: 90 rows, 7 columns
Test Data: 180 rows, 7 columns

Columns:

sentence: Text input in Betawi dialect (e.g., "Kenapa sih Avanza jadi boros bensin gini?").
fuel, machine, others, part, price, service: Sentiment labels (positive, negative, neutral) for each category.

Data Issues:

No missing values were found in the train, validation, or test datasets.
The dataset is relatively small, which may impact model generalization.

Project Structure

The analysis is conducted in a Jupyter Notebook: betawi-text-multilabel-classification.ipynb. The notebook is structured as follows:

Library Imports: Imports libraries including pandas, numpy, matplotlib, seaborn, scikit-learn, skmultilearn, altair, plotly, and missingno.
Data Loading: Loads the train, validation, and test datasets and provides a summary of dataset sizes and features.
Data Cleansing: Verifies data types and checks for missing values, confirming no missing data.
Data Visualization: Visualizes sentiment distributions and other patterns in the data.
Model Training and Evaluation: Trains multilabel classification models (Naive Bayes, SVM, KNN) using techniques like Binary Relevance and Classifier Chain, and evaluates their performance.
Model Evaluation: Generates confusion matrices and classification reports for each model and saves predictions to hasil_prediksi_semua_model.csv.

Model Evaluation

The models were evaluated using the following metrics:

Accuracy: Measures the overall correctness of predictions across all labels.
Classification Report: Provides precision, recall, and F1-score for each label category.
Confusion Matrix: Visualizes the number of correct and incorrect predictions for each sentiment category (positive, negative, neutral) across the six labels.

Evaluation Results

The confusion matrices and classification reports reveal the following:

Naive Bayes: Performs well for text-based sentiment classification due to its simplicity and effectiveness with TF-IDF features.
SVM: Shows strongest performance, particularly for categories with distinct sentiment patterns, but may struggle with overlapping sentiments.
KNN: Has moderate performance, likely due to sensitivity to the small dataset size and high-dimensional TF-IDF features.

These results suggest that Naive Bayes and SVM are more reliable for multilabel sentiment classification in this dataset, providing a foundation for analyzing Betawi text sentiments.

Others

Label Distribution:
Bag of Words:
Bag of N-grams:

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
dataset_betawi		dataset_betawi
images		images
.DS_Store		.DS_Store
README.md		README.md
betawi-text-multilabel-classification.ipynb		betawi-text-multilabel-classification.ipynb
prediction_result.csv		prediction_result.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Betawi Multilabel Text Classification

Overview

Dataset

Project Structure

Model Evaluation

Evaluation Results

Others

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Betawi Multilabel Text Classification

Overview

Dataset

Project Structure

Model Evaluation

Evaluation Results

Others

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages