This repository contains a comparative study of various machine learning classifiers for natural language processing (NLP) sentiment analysis. The project evaluates model performance on a dataset of user opinions about US Airlines from X (formerly Twitter).
Objective: Classify the polarity of airline tweets using text classification techniques.
Dataset: Twitter_US_Airline_Sentiment.csv (attached to this repository)
Feature Extraction: TF-IDF vectorization tested under three vocabulary constraints (minimum 5 document frequency, max 2500 words, and max 500 words).
Models Evaluated:
- Logistic Regression
- Support Vector Machines (LinearSVC)
- Random Forests
- Feed-forward Neural Network
The classifiers are evaluated using 5-fold cross-validation. Performance is measured across three primary metrics:
- Accuracy
- F1-score
- Fit time
- (
Logistic Regression) consistently outperformed the others. It achieved its highest performance in Experiment 1, reaching 78.37% Accuracy and a 76.95% F1-score, while remaining highly computationally efficient. - (
Support Vector Machines (SVM)) delivered very competitive accuracy and F1-scores, and stood out by having the fastest fit times across all experiments. - (
Random Forest) and the (Feed-Forward Neural Network) yielded significantly lower accuracy and F1-scores, alongside much slower training times. The Neural Network, due to its complexity, was the least effective and least efficient model for this specific setup.
The experiments revealed a clear trade-off between computational cost and classification performance based on vocabulary constraints:
- Experiment 1 (
min_df=5): Highest accuracy and F1-scores. Best choice for maximizing predictive performance. - Experiment 2 (
max_features=2500): Offers the best balance of solid performance and faster training times. - Experiment 3 (
max_features=500): Fastest to train, but suffers a significant drop in accuracy and F1-score due to the restricted vocabulary.
Note: See the attached Jupyter Notebook for full code, metrics, and exploratory data analysis (EDA).
