Course: BDA 622 -- Marketing Analytics | Mercer University | Spring 2024 Authors: Sahil Patel | Drashti Khatra
- Project Overview
- Business Problem
- Dataset
- Text Processing Pipeline
- Models Built
- Experimental Results
- Key Findings & Analysis
- Marketing Implications
- Limitations & Future Work
- Tools & Technologies
- References
This project develops a text mining and sentiment analysis pipeline using RapidMiner to classify movie reviews as positive or negative. Two k-Nearest Neighbor (k-NN) classification models were built with different parameter configurations and evaluated on accuracy, precision, recall, and F1 metrics using 10-fold cross-validation.
The project demonstrates the full text analytics workflow -- from raw unstructured text through preprocessing, feature extraction, model training, parameter optimization, and performance evaluation -- with direct application to marketing analytics use cases such as brand monitoring, customer feedback analysis, and competitive benchmarking.
Businesses generate and receive vast volumes of unstructured text data daily -- customer reviews, social media posts, survey responses, and support tickets. Manual analysis of this data at scale is not feasible.
The core question this project addresses:
Can a machine learning model accurately classify customer sentiment from raw text data, and how significantly does parameter tuning impact classification performance?
Understanding this has direct commercial value:
- Identify product quality issues before they escalate
- Monitor brand health across review platforms
- Detect emerging customer trends and risks in real time
- Personalize marketing messages based on sentiment signals
| Attribute | Detail |
|---|---|
| Source | Stanford Large Movie Review Dataset |
| Full dataset size | 50,000 labeled reviews |
| Working dataset | 1,000 reviews (500 positive, 500 negative) |
| Sampling method | Random stratified selection from training set |
| Label balance | Perfectly balanced -- 50% positive, 50% negative |
| Data format | Individual labeled text files |
Dataset Citation: Maas, A. et al. (2011). Learning Word Vectors for Sentiment Analysis. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. Link: https://ai.stanford.edu/~amaas/data/sentiment/
All text preprocessing was performed in RapidMiner using the following operators:
Process Documents from Filesoperator- Settings: extract text only, use file extension as type, System encoding
| Model 1 | Model 2 | |
|---|---|---|
| Mode | Non-letters | Linguistic tokens (English) |
Transform Casesoperator -- all tokens converted to lowercase
Stem (Snowball)operator -- English language- Reduces words to their root form (e.g., "running" → "run")
- TF-IDF (Term Frequency -- Inverse Document Frequency) word vectors
- Weights terms by how frequently they appear in a document relative to the entire corpus, reducing the influence of common words
| Model 1 | Model 2 | |
|---|---|---|
| Method | Percentual | Absolute |
| Lower bound | 3.0% | 3 documents |
| Upper bound | 30.0% | 30 documents |
Cross Validationoperator- 10 folds, automatic sampling, parallel execution enabled for both models
| Parameter | Value |
|---|---|
| Algorithm | k-Nearest Neighbors (k-NN) |
| k (number of neighbors) | 5 |
| Distance measure | Cosine Similarity |
| Voting | Weighted |
| Tokenization mode | Non-letters |
| Pruning | Percentual |
| Parameter | Value |
|---|---|
| Algorithm | k-Nearest Neighbors (k-NN) |
| k (number of neighbors) | 10 |
| Distance measure | Euclidean Distance |
| Voting | Weighted |
| Tokenization mode | Linguistic tokens |
| Pruning | Absolute |
Why k-NN for text classification? k-NN is a non-parametric, instance-based algorithm well-suited for high-dimensional text feature spaces (TF-IDF vectors). It classifies a document by majority vote among its k nearest neighbors in the feature space. No model training is required -- the algorithm memorizes the training examples and classifies new instances at prediction time.
| Metric | Model 1 (Default) | Model 2 (Optimized) |
|---|---|---|
| Overall Accuracy | 77.90% | 83.50% |
| True Positives (TP) | 406 | 473 |
| True Negatives (TN) | 373 | 362 |
| False Positives (FP) | 94 | 138 |
| False Negatives (FN) | 127 | 27 |
| Metric | Model 1 | Model 2 | Change |
|---|---|---|---|
| Precision -- Positive | 76.17% | 77.41% | +1.24% |
| Recall -- Positive | 81.20% | 94.60% | +13.40% |
| Precision -- Negative | 79.87% | 93.06% | +13.19% |
| Recall -- Negative | 74.60% | 72.40% | -2.20% |
Model 2 achieved 83.50% overall accuracy, a 5.6 percentage point improvement over Model 1's 77.90%. This improvement was driven by three key parameter changes:
- Increasing k from 5 to 10 -- more neighbors reduces sensitivity to noise
- Switching to Euclidean distance -- better suited for normalized TF-IDF vectors in the specific feature space of this dataset
- Switching to linguistic tokenization -- more semantically meaningful token boundaries compared to non-letter splitting
False negatives dropped from 127 to 27 -- a 78.7% reduction. In marketing contexts, false negatives (missed negative sentiment) are often more costly than false positives because undetected customer dissatisfaction can escalate unaddressed. Model 2 is significantly better suited for real-world brand monitoring applications.
Model 2 improved positive recall from 81.20% to 94.60% at the cost of increasing false positives from 94 to 138. This is an expected trade-off in classification -- the optimal balance depends on the business priority:
- If missing negative sentiment is costly (e.g., product safety, quality issues): prioritize recall -- Model 2 is the better choice
- If false alarms are costly (e.g., unnecessary escalations): balance precision and recall depending on operational context
Using TF-IDF rather than raw term frequency downweights common words that appear across all documents (e.g., "the", "and") and upweights discriminative terms specific to positive or negative sentiment, improving classifier performance.
The techniques demonstrated in this project have direct applications in marketing:
| Application | How Sentiment Analysis Helps |
|---|---|
| Brand monitoring | Track positive/negative sentiment trends across review platforms over time |
| Product quality | Detect spikes in negative sentiment around specific product features |
| Customer experience | Identify recurring complaint themes before they escalate |
| Competitive benchmarking | Compare sentiment scores against competitor products |
| Campaign measurement | Measure sentiment shift before and after marketing campaigns |
| Personalization | Segment customers by sentiment profile for targeted messaging |
- Dataset scope: Only movie reviews -- sentiment patterns may differ across domains (e.g., automotive, financial services, consumer products)
- Binary classification only: Positive/negative labels; no neutral category or intensity scoring
- Tool dependency: RapidMiner workflow is not easily reproducible without the software license
- Static model: No retraining mechanism for concept drift over time
- Rebuild pipeline in Python using
scikit-learn,NLTK, orspaCyfor full reproducibility - Extend to multi-class sentiment (positive, neutral, negative, mixed)
- Experiment with more advanced models: Naive Bayes, SVM, BERT
- Apply domain-specific sentiment lexicons for non-movie-review text
- Integrate real-time social media data via API for live brand monitoring
- Add topic modeling (LDA) to identify themes within sentiment clusters
| Tool / Technology | Purpose |
|---|---|
| RapidMiner | End-to-end text mining and ML workflow |
| TF-IDF Vectorization | Text feature extraction |
| Snowball Stemmer | NLP preprocessing -- stemming |
| k-NN Classification | Supervised sentiment classification |
| 10-fold Cross-Validation | Model evaluation and generalization testing |
| Stanford Movie Review Dataset | Benchmark NLP dataset |
| File | Description |
|---|---|
Final-Report.docx |
Full written project report with methodology and results |
Marketing-Analytics-Presentation.pptx |
Class presentation slides |
BUS622-Final-Project.pdf |
Original project brief and requirements |
Project-Rubric.pdf |
Grading rubric (written report + presentation) |
Maas, A., Daly, R., Pham, P., Huang, D., Ng, A., & Potts, C. (2011). Learning Word Vectors for Sentiment Analysis. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, 142--150. https://ai.stanford.edu/~amaas/data/sentiment/
Lilien, G. L., Rangaswamy, A., & de Bruyn, A. (2017). Principles of Marketing Engineering and Analytics (3rd ed.). DecisionPro, Inc.
Completed as part of BDA 622 -- Marketing Analytics, Mercer University, Stetson-Hatcher School of Business, Spring 2024.