Project Status: Completed (academic project 2024-2025)
The goal of CAPITOL is to classify U.S. political tweets based on their content, distinguishing between Democratic and Republican affiliations. This project explores the entire NLP pipeline: from preprocessing raw social media text to comparing traditional Machine Learning baselines against Deep Learning architectures (RNNs/LSTMs).
The study highlights how different feature representation techniques (sparse vs dense embeddings) impact classification performance on a balanced dataset of 30,000 tweets.
- Language: Python
- Libraries: TensorFlow/Keras, Scikit-Learn, NLTK, Gensim, XGBoost, Pandas, NumPy, Matplotlib/Seaborn.
- Techniques: TF-IDF, word embeddings (GloVe, FastText), LSTM, Logistic Regression.
We implemented a robust cleaning pipeline to handle social media noise:
- Hashtag decomposition: using 'wordninja' to split tags like "#BuildBackBetter" into "Built Back Better".
- Cleaning: removal of emojis, URLs, mentions and non-ASCII characters.
- Tokenization and padding: standardized input length (31 tokens) for neural networks.
We compared two main approaches:
- Sparse: TF-IDF vectorization (bag-of-words approach)
- Dense:
- Pre-trained GloVe: (glove.6B.100d) trained on Wikipedia/Web.
- Custom FastText: trained from scratch on our specific political dataset to capture domain specific slang (e.g. "libtard", "trumper").
We trained and evaluated four distinct configurations:
- Logistic regression + TF-IDF (baseline)
- XGBoost + TF-IDF
- LSTM (an RNN) + pre-trained GloVe
- LSTM (an RNN) + custom FastText
The evaluation was performed on a held-out test using accuracy and F1-score.
Key insights: while the deep learning model with custom FastText embeddings performed competitively (0.82), the simpler logistic regression achieved the best results (0.84). This suggests that specific keywords (captured well by TF-IDF) are highly discriminative markers for political affiliation in this dataset.
Throughout the project, we faced limited computational resources on Deepnote, which led us to reduce the dataset to 30,000 samples to ensure manageable training times. Library compatibility issues also slowed down model experimentation. Additionally, downloading large pre-trained models like GloVe increased setup time and disk usage within the constrained cloud environment.
In summary, our project involved several key steps, preprocessing political tweet data, applying various vectorization and embedding techniques, implementing both classical and deep learning models, and evaluating their performance. This hands-on experience enhanced our understanding of NLP workflows and highlighted how different textual representations and algorithms influence classification outcomes.
Although the performance varied across models, all approaches demonstrated the ability to identify political affiliation in text with reasonable accuracy. Logistic Regression with TF-IDF stood out for its simplicity and effectiveness. Meanwhile, LSTM models with embeddings offered strong performance while also capturing sequential patterns in language. The project provided us with practical experience in implementing and evaluating a real-world NLP task using multiple modeling techniques.
For a detailed analysis, please refer to the full project report.
If you have suggestions for improvements, new experiments, or alternative embeddings to try, feel free to open an issue or submit a pull request. Let’s continue improving this project together!
- Developed by Tsuhuy Ecaterina and Brusati Lorenzo for the "Text Mining and Natural Language Processing" course.
- The dataset used for this project was provided by Jacobvs (https://huggingface.co/datasets/Jacobvs/PoliticalTweets) via HuggingFace.
- Pre-trained Embeddings:
- GloVe: we utilized the 'glove.6B.100d' pre-trained vectors provided by Stanford NLP (https://nlp.stanford.edu/projects/glove/).
- FastText: the FastText model architecture used for custom training is develop by Facebook AI research (FAIR) (https://fasttext.cc/).