IMDb Sentiment Analysis — Contextual Word Classification - Shanghai Jiao Tong University/Mines Paris
Authors: Wael Ben Slima & Marko Babic
Notebook: Wael_Marko_IMDB_Sentiment_Analysis.ipynb
Dataset: IMDb Dataset of 50K Movie Reviews
This notebook implements a contextual word sentiment classification model using the IMDb movie review dataset.
The primary goal is to classify individual words as positive, negative, or neutral by leveraging sentence-level sentiment labels and the context of surrounding words.
For example:
- “beautiful” → Positive
- “defeat” → Negative
The IMDb dataset contains 50,000 movie reviews, split into:
- 25,000 for training
- 25,000 for testing
Each review is labeled as either positive or negative.
- Load data using Pandas.
- Clean the text: remove HTML tags & punctuation, lowercase, strip numbers, remove stopwords.
- Tokenize and pad sequences for model input.
- Utilize TensorFlow / Keras.
- Architecture includes:
- Embedding layer
- (Bi)LSTM layer to capture contextual dependencies
- Dense output layers for classification
- Train on sentence-level labels.
- Use callbacks (e.g.
ModelCheckpoint) to save the best model.
- Plot accuracy and loss curves.
- Compute confusion matrix and classification metrics.