Skip to content

angelalim88/Betawi-Multi-Label-Text-Classification

Repository files navigation

Betawi Multilabel Text Classification

Overview

This project focuses on performing multilabel text classification on a dataset of sentences in the Betawi dialect, a language spoken in Jakarta, Indonesia. The dataset contains sentences with associated sentiment labels across six categories: fuel, machine, others, part, price, and service. Each category can have a sentiment of positive, negative, or neutral. The project involves data preprocessing, exploratory data analysis (EDA), and the application of machine learning models (Naive Bayes, SVM, and KNN) to classify sentiments across these categories.

Objectives:

  • Preprocess and clean the dataset to ensure data quality and consistency.
  • Conduct EDA to understand the distribution of sentiments across categories.
  • Develop and evaluate multilabel classification models to predict sentiments for each category based on textual input.
  • Support sentiment analysis for Betawi text, enabling insights into public opinions about vehicles and related services.

Dataset

The dataset is split into three CSV files: train_preprocess.csv, valid_preprocess.csv, and test_preprocess.csv, with the following details:

  • Train Data: 810 rows, 7 columns
  • Validation Data: 90 rows, 7 columns
  • Test Data: 180 rows, 7 columns

Columns:

  • sentence: Text input in Betawi dialect (e.g., "Kenapa sih Avanza jadi boros bensin gini?").
  • fuel, machine, others, part, price, service: Sentiment labels (positive, negative, neutral) for each category.

Data Issues:

  • No missing values were found in the train, validation, or test datasets.
  • The dataset is relatively small, which may impact model generalization.

Project Structure

The analysis is conducted in a Jupyter Notebook: betawi-text-multilabel-classification.ipynb. The notebook is structured as follows:

  1. Library Imports: Imports libraries including pandas, numpy, matplotlib, seaborn, scikit-learn, skmultilearn, altair, plotly, and missingno.
  2. Data Loading: Loads the train, validation, and test datasets and provides a summary of dataset sizes and features.
  3. Data Cleansing: Verifies data types and checks for missing values, confirming no missing data.
  4. Data Visualization: Visualizes sentiment distributions and other patterns in the data.
  5. Model Training and Evaluation: Trains multilabel classification models (Naive Bayes, SVM, KNN) using techniques like Binary Relevance and Classifier Chain, and evaluates their performance.
  6. Model Evaluation: Generates confusion matrices and classification reports for each model and saves predictions to hasil_prediksi_semua_model.csv.

Model Evaluation

The models were evaluated using the following metrics:

  • Accuracy: Measures the overall correctness of predictions across all labels.
  • Classification Report: Provides precision, recall, and F1-score for each label category.
  • Confusion Matrix: Visualizes the number of correct and incorrect predictions for each sentiment category (positive, negative, neutral) across the six labels.

Confusion Matrices NB Confusion Matrices SVM Confusion Matrices KNN

Evaluation Results

The confusion matrices and classification reports reveal the following:

  • Naive Bayes: Performs well for text-based sentiment classification due to its simplicity and effectiveness with TF-IDF features.
  • SVM: Shows strongest performance, particularly for categories with distinct sentiment patterns, but may struggle with overlapping sentiments.
  • KNN: Has moderate performance, likely due to sensitivity to the small dataset size and high-dimensional TF-IDF features.

These results suggest that Naive Bayes and SVM are more reliable for multilabel sentiment classification in this dataset, providing a foundation for analyzing Betawi text sentiments.

Others

  • Label Distribution: Label Distribution
  • Bag of Words: Bag of Words
  • Bag of N-grams: Bag of N-grams

About

This project performs multilabel text classification on Betawi dialect sentences to identify sentiments (positive, negative, neutral) across six categories (fuel, machine, others, part, price, service) to support public opinion analysis on vehicles and related services.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors