This project focuses on performing multilabel text classification on a dataset of sentences in the Betawi dialect, a language spoken in Jakarta, Indonesia. The dataset contains sentences with associated sentiment labels across six categories: fuel, machine, others, part, price, and service. Each category can have a sentiment of positive, negative, or neutral. The project involves data preprocessing, exploratory data analysis (EDA), and the application of machine learning models (Naive Bayes, SVM, and KNN) to classify sentiments across these categories.
Objectives:
- Preprocess and clean the dataset to ensure data quality and consistency.
- Conduct EDA to understand the distribution of sentiments across categories.
- Develop and evaluate multilabel classification models to predict sentiments for each category based on textual input.
- Support sentiment analysis for Betawi text, enabling insights into public opinions about vehicles and related services.
The dataset is split into three CSV files: train_preprocess.csv, valid_preprocess.csv, and test_preprocess.csv, with the following details:
- Train Data: 810 rows, 7 columns
- Validation Data: 90 rows, 7 columns
- Test Data: 180 rows, 7 columns
Columns:
sentence: Text input in Betawi dialect (e.g., "Kenapa sih Avanza jadi boros bensin gini?").fuel,machine,others,part,price,service: Sentiment labels (positive,negative,neutral) for each category.
Data Issues:
- No missing values were found in the train, validation, or test datasets.
- The dataset is relatively small, which may impact model generalization.
The analysis is conducted in a Jupyter Notebook: betawi-text-multilabel-classification.ipynb. The notebook is structured as follows:
- Library Imports: Imports libraries including pandas, numpy, matplotlib, seaborn, scikit-learn, skmultilearn, altair, plotly, and missingno.
- Data Loading: Loads the train, validation, and test datasets and provides a summary of dataset sizes and features.
- Data Cleansing: Verifies data types and checks for missing values, confirming no missing data.
- Data Visualization: Visualizes sentiment distributions and other patterns in the data.
- Model Training and Evaluation: Trains multilabel classification models (Naive Bayes, SVM, KNN) using techniques like Binary Relevance and Classifier Chain, and evaluates their performance.
- Model Evaluation: Generates confusion matrices and classification reports for each model and saves predictions to
hasil_prediksi_semua_model.csv.
The models were evaluated using the following metrics:
- Accuracy: Measures the overall correctness of predictions across all labels.
- Classification Report: Provides precision, recall, and F1-score for each label category.
- Confusion Matrix: Visualizes the number of correct and incorrect predictions for each sentiment category (
positive,negative,neutral) across the six labels.
The confusion matrices and classification reports reveal the following:
- Naive Bayes: Performs well for text-based sentiment classification due to its simplicity and effectiveness with TF-IDF features.
- SVM: Shows strongest performance, particularly for categories with distinct sentiment patterns, but may struggle with overlapping sentiments.
- KNN: Has moderate performance, likely due to sensitivity to the small dataset size and high-dimensional TF-IDF features.
These results suggest that Naive Bayes and SVM are more reliable for multilabel sentiment classification in this dataset, providing a foundation for analyzing Betawi text sentiments.





