Skip to content

Latest commit

 

History

History
56 lines (36 loc) · 1.75 KB

File metadata and controls

56 lines (36 loc) · 1.75 KB

Sentiment Analysis with NLP Models

This project applies Natural Language Processing (NLP) techniques to classify IMDB movie reviews as positive or negative.

It compares three different modeling approaches:

  • Naive Bayes with Bag-of-Words / TF-IDF
  • Deep Learning with LSTM + GloVe embeddings
  • Transformer-based models (ALBERT, DistilBERT)

Dataset

Source: IMDB 50K Movie Reviews Dataset

Size: 50,000 labeled reviews (balanced: 25k positive / 25k negative)

Target: Sentiment (binary classification)

Methodology

1. Exploratory Data Analysis (EDA)

  • Review length distribution (words & characters)
  • Stopword analysis
  • Word clouds for positive/negative reviews
  • Patterns: HTML tags, emojis, excessive punctuation, slang

2. Preprocessing

  • HTML/URL removal
  • Lowercasing & contraction expansion
  • Stopword removal & lemmatization
  • Tokenization (NLTK & HuggingFace)

3. Feature Engineering

  • CountVectorizer & TF-IDF for ML baseline
  • Word embeddings (GloVe 100D) for LSTM
  • Tokenizer + Padding for sequence models

4. Models Trained

  • Naive Bayes: Baseline with DTM & TF-IDF
  • LSTM (Bi-LSTM + GlobalMaxPool): trained with GloVe embeddings
  • Transformers: Fine-tuned ALBERT & DistilBERT (HuggingFace Trainer)

5. Evaluation

  • Metrics: Accuracy, Precision, Recall, F1
  • Confusion matrices plotted for each model

Results

  • Naive Bayes: Good baseline, limited accuracy with %86 macro average F1-score
  • LSTM + GloVe: Improved performance, better contextual capture with %87 macro average F1-score
  • DistilBERT / ALBERT: Best overall performance with %94 macro average F1-score