A Python-based toolkit for analyzing and classifying tweet sentiments using Natural Language Processing (NLP) and Machine Learning techniques. This project provides tools for text cleaning, tokenization, vectorization, and sentiment classification of tweets into positive, negative, or neutral categories.
For the fastest way to explore results and run the complete analysis, open the Jupyter notebooks:
| Notebook | Description |
|---|---|
| tweets.ipynb | Main analysis notebook — complete pipeline with outputs |
| tweets_analysis.ipynb | Additional exploratory analysis |
| tweets_similarity.ipynb | Tweet similarity comparisons |
jupyter notebook tweets.ipynb- Text Preprocessing: Clean tweets by removing mentions, URLs, hashtags, and punctuation
- Multiple Tokenization Methods: Support for various stemming (Porter, Snowball, Lancaster) and lemmatization approaches
- Spell Correction: Automatic misspelling correction using SymSpell
- Vectorization Options: Count Vectorizer, Binary Vectorizer, and TF-IDF Vectorizer
- Machine Learning Classification: Train sentiment classifiers using Random Forest
- Interactive Notebooks: Jupyter notebooks for exploratory analysis and experimentation
The project includes pre-processed tweet datasets:
./data/processedNegative.csv- Tweets labeled as negative sentiment./data/processedPositive.csv- Tweets labeled as positive sentiment./data/processedNeutral.csv- Tweets labeled as neutral sentiment
| Module | Description |
|---|---|
config.py |
Load and merge tweet datasets into a unified DataFrame |
cleaning.py |
Text preprocessing (remove mentions, URLs, hashtags, punctuation) |
tokenizer.py |
Tokenization with stemming, lemmatization, and spell correction |
vectorizer.py |
Convert text to numerical features (BoW, Binary, TF-IDF) |
train.py |
Train and evaluate machine learning classifiers |
- Python 3.8 or higher
- pip (Python package manager)
- pandas
- scikit-learn
- nltk
- symspellpy
-
Clone the repository
git clone https://github.com/mylastresort/tweets cd tweets -
Create a virtual environment (recommended)
python -m venv venv source venv/bin/activate # On Linux/macOS # or venv\Scripts\activate # On Windows
-
Install dependencies
pip install pandas scikit-learn nltk symspellpy
-
Download NLTK data
python -c "import nltk; nltk.download('punkt_tab'); nltk.download('stopwords'); nltk.download('wordnet')"Or run the main module:
python main.py
from config import get_merged_dataframe
from cleaning import clean_tweets
from vectorizer import count_vectorize, tfidf_vectorize
from train import train_model, evaluate_model
# Load and merge datasets
df = get_merged_dataframe(
'./data/processedNegative.csv',
'./data/processedPositive.csv',
'./data/processedNeutral.csv'
)
# Vectorize tweets
vectorizer, bow_matrix = count_vectorize(df, 'tweet')
# Train a classifier
model = train_model(X_train, y_train)
# Evaluate the model
accuracy = evaluate_model(model, X_test, y_test)from vectorizer import count_vectorize, binary_vectorize, tfidf_vectorize
# Count Vectorizer (word frequencies)
vectorizer, matrix = count_vectorize(df, 'tweet')
# Binary Vectorizer (word presence/absence)
vectorizer, matrix = binary_vectorize(df, 'tweet')
# TF-IDF Vectorizer (term frequency-inverse document frequency)
vectorizer, matrix = tfidf_vectorize(df, 'tweet')from tokenizer import stem_tokens, lemmatize_tokens, misspell_and_lemmatize_tokens
from vectorizer import count_vectorize
# With stemming
vectorizer, matrix = count_vectorize(df, 'tweet', tokenizer=stem_tokens)
# With lemmatization
vectorizer, matrix = count_vectorize(df, 'tweet', tokenizer=lemmatize_tokens)
# With spell correction + lemmatization
vectorizer, matrix = count_vectorize(df, 'tweet', tokenizer=misspell_and_lemmatize_tokens)Contributions are welcome! To contribute:
- Fork the repository
- Create a feature branch (
git checkout -b feature/new-feature) - Commit your changes (
git commit -am 'Add new feature') - Push to the branch (
git push origin feature/new-feature) - Open a Pull Request
This project uses the following open-source libraries:
- NLTK - Natural Language Toolkit
- scikit-learn - Machine Learning in Python
- SymSpell - Spelling correction
- pandas - Data manipulation and analysis