Skip to content

konnatzeik/NLP-Sentiment-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

image NLP-Sentiment-Analysis: US-Airline Tweets

This repository contains a comparative study of various machine learning classifiers for natural language processing (NLP) sentiment analysis. The project evaluates model performance on a dataset of user opinions about US Airlines from X (formerly Twitter).

Project Overview

Objective: Classify the polarity of airline tweets using text classification techniques.

Dataset: Twitter_US_Airline_Sentiment.csv (attached to this repository)

Feature Extraction: TF-IDF vectorization tested under three vocabulary constraints (minimum 5 document frequency, max 2500 words, and max 500 words).

Models Evaluated:

  • Logistic Regression
  • Support Vector Machines (LinearSVC)
  • Random Forests
  • Feed-forward Neural Network

The classifiers are evaluated using 5-fold cross-validation. Performance is measured across three primary metrics:

  • Accuracy
  • F1-score
  • Fit time

Results

  • (Logistic Regression) consistently outperformed the others. It achieved its highest performance in Experiment 1, reaching 78.37% Accuracy and a 76.95% F1-score, while remaining highly computationally efficient.
  • (Support Vector Machines (SVM)) delivered very competitive accuracy and F1-scores, and stood out by having the fastest fit times across all experiments.
  • (Random Forest) and the (Feed-Forward Neural Network) yielded significantly lower accuracy and F1-scores, alongside much slower training times. The Neural Network, due to its complexity, was the least effective and least efficient model for this specific setup.

The Impact of Vocabulary Size (TF-IDF)

The experiments revealed a clear trade-off between computational cost and classification performance based on vocabulary constraints:

  • Experiment 1 (min_df=5): Highest accuracy and F1-scores. Best choice for maximizing predictive performance.
  • Experiment 2 (max_features=2500): Offers the best balance of solid performance and faster training times.
  • Experiment 3 (max_features=500): Fastest to train, but suffers a significant drop in accuracy and F1-score due to the restricted vocabulary.

Note: See the attached Jupyter Notebook for full code, metrics, and exploratory data analysis (EDA).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors