Tweet Sentiment Analysis

A Python-based toolkit for analyzing and classifying tweet sentiments using Natural Language Processing (NLP) and Machine Learning techniques. This project provides tools for text cleaning, tokenization, vectorization, and sentiment classification of tweets into positive, negative, or neutral categories.

Quick Access — Interactive Notebooks

For the fastest way to explore results and run the complete analysis, open the Jupyter notebooks:

Notebook	Description
tweets.ipynb	Main analysis notebook — complete pipeline with outputs
tweets_analysis.ipynb	Additional exploratory analysis
tweets_similarity.ipynb	Tweet similarity comparisons

jupyter notebook tweets.ipynb

Features

Text Preprocessing: Clean tweets by removing mentions, URLs, hashtags, and punctuation
Multiple Tokenization Methods: Support for various stemming (Porter, Snowball, Lancaster) and lemmatization approaches
Spell Correction: Automatic misspelling correction using SymSpell
Vectorization Options: Count Vectorizer, Binary Vectorizer, and TF-IDF Vectorizer
Machine Learning Classification: Train sentiment classifiers using Random Forest
Interactive Notebooks: Jupyter notebooks for exploratory analysis and experimentation

Data

The project includes pre-processed tweet datasets:

./data/processedNegative.csv - Tweets labeled as negative sentiment
./data/processedPositive.csv - Tweets labeled as positive sentiment
./data/processedNeutral.csv - Tweets labeled as neutral sentiment

Module Reference

Module	Description
`config.py`	Load and merge tweet datasets into a unified DataFrame
`cleaning.py`	Text preprocessing (remove mentions, URLs, hashtags, punctuation)
`tokenizer.py`	Tokenization with stemming, lemmatization, and spell correction
`vectorizer.py`	Convert text to numerical features (BoW, Binary, TF-IDF)
`train.py`	Train and evaluate machine learning classifiers

Getting Started

Requirements

Software Requirements

Python 3.8 or higher
pip (Python package manager)

Python Dependencies

pandas
scikit-learn
nltk
symspellpy

Installation

Clone the repository

git clone https://github.com/mylastresort/tweets
cd tweets

Create a virtual environment (recommended)

python -m venv venv
source venv/bin/activate  # On Linux/macOS
# or
venv\Scripts\activate     # On Windows

Install dependencies

pip install pandas scikit-learn nltk symspellpy

Download NLTK data

python -c "import nltk; nltk.download('punkt_tab'); nltk.download('stopwords'); nltk.download('wordnet')"

Or run the main module:

python main.py

Usage

Quick Start

from config import get_merged_dataframe
from cleaning import clean_tweets
from vectorizer import count_vectorize, tfidf_vectorize
from train import train_model, evaluate_model

# Load and merge datasets
df = get_merged_dataframe(
    './data/processedNegative.csv',
    './data/processedPositive.csv',
    './data/processedNeutral.csv'
)

# Vectorize tweets
vectorizer, bow_matrix = count_vectorize(df, 'tweet')

# Train a classifier
model = train_model(X_train, y_train)

# Evaluate the model
accuracy = evaluate_model(model, X_test, y_test)

Using Different Vectorizers

from vectorizer import count_vectorize, binary_vectorize, tfidf_vectorize

# Count Vectorizer (word frequencies)
vectorizer, matrix = count_vectorize(df, 'tweet')

# Binary Vectorizer (word presence/absence)
vectorizer, matrix = binary_vectorize(df, 'tweet')

# TF-IDF Vectorizer (term frequency-inverse document frequency)
vectorizer, matrix = tfidf_vectorize(df, 'tweet')

Using Different Tokenization Methods

from tokenizer import stem_tokens, lemmatize_tokens, misspell_and_lemmatize_tokens
from vectorizer import count_vectorize

# With stemming
vectorizer, matrix = count_vectorize(df, 'tweet', tokenizer=stem_tokens)

# With lemmatization
vectorizer, matrix = count_vectorize(df, 'tweet', tokenizer=lemmatize_tokens)

# With spell correction + lemmatization
vectorizer, matrix = count_vectorize(df, 'tweet', tokenizer=misspell_and_lemmatize_tokens)

Contributing

Contributions are welcome! To contribute:

Fork the repository
Create a feature branch (git checkout -b feature/new-feature)
Commit your changes (git commit -am 'Add new feature')
Push to the branch (git push origin feature/new-feature)
Open a Pull Request

Acknowledgements

This project uses the following open-source libraries:

NLTK - Natural Language Toolkit
scikit-learn - Machine Learning in Python
SymSpell - Spelling correction
pandas - Data manipulation and analysis

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tweet Sentiment Analysis

Quick Access — Interactive Notebooks

Features

Data

Module Reference

Getting Started

Requirements

Software Requirements

Python Dependencies

Installation

Usage

Quick Start

Using Different Vectorizers

Using Different Tokenization Methods

Contributing

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data		data
.gitignore		.gitignore
README.md		README.md
cleaning.py		cleaning.py
config.py		config.py
main.py		main.py
tokenizer.py		tokenizer.py
train.py		train.py
tweets.ipynb		tweets.ipynb
tweets_analysis.ipynb		tweets_analysis.ipynb
tweets_similarity.ipynb		tweets_similarity.ipynb
vectorizer.py		vectorizer.py

Folders and files

Latest commit

History

Repository files navigation

Tweet Sentiment Analysis

Quick Access — Interactive Notebooks

Features

Data

Module Reference

Getting Started

Requirements

Software Requirements

Python Dependencies

Installation

Usage

Quick Start

Using Different Vectorizers

Using Different Tokenization Methods

Contributing

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages