Sentiment Analysis of Flood Risk Perception in Greater Accra using X/Twitter Data

📖 Project Overview

This project performs Sentiment Analysis on X/Twitter data to gauge public perception and emotional tone regarding flood risk in Ghana's Greater Accra Region. By scraping and analyzing tweets containing flood-related keywords, this study categorizes public sentiment into Positive, Negative, and Neutral classes. The goal is to provide insights into community concerns, government response perceptions, and overall social vulnerability, complementing traditional geospatial flood risk models with human-centric data.

🗺️ Study Area Context

The analysis focuses on public discourse surrounding flooding in the Greater Accra Region, Ghana's capital and most densely populated area. This region is highly vulnerable to flooding due to its:

Low-lying coastal plains
Rapid urbanization
Inadequate drainage infrastructure
Frequent heavy rainfall events

Understanding public sentiment here is crucial for effective disaster communication, policy-making, and community engagement strategies.

📊 Data Collection

Source: X (formerly Twitter)
Method: Data was scraped using third-party tools (Twitter Scraper & Twitter Scraper V2) on Apify.com.
Keywords: Tweets were collected using flood-related buzzwords and specific location names (e.g., "Weija flood", "Adabraka flood", "Tse Addo flood", "Kaneshie flood", "Odaw drain").
Dataset: 1,232 publicly accessible user tweets were collected and downloaded in CSV format.
Data Fields: The dataset includes:
- tweet (text content)
- author username, location, description
- date of tweet
- Engagement metrics (likes count, retweets)
- Author account details (followers, following, account creation date)

⚙️ Methodology & Processing

1. Data Preprocessing

Raw tweet data contains noise that must be cleaned for accurate analysis. The following preprocessing steps were implemented in Python using libraries like pandas, nltk, and re:

Removal of Irrelevant Columns: Focus was placed solely on the tweet text (tweet column).
Text Cleaning: A custom function was defined to:
- Convert text to lowercase.
- Remove URLs, user mentions (@), and hashtags (#).
- Remove punctuation and non-word characters.
- Tokenize text (split into individual words).
- Remove stop words (e.g., "the", "and", "is") using the NLTK corpus.
Stemming: Words were reduced to their root form using the Porter Stemmer algorithm (e.g., "flooding" -> "flood") to normalize the vocabulary.
Deduplication: Duplicate tweets were removed to prevent skewing the results.

2. Sentiment Classification

Tool: The TextBlob library was used for initial sentiment polarity scoring.
Scoring: Each tweet was assigned a polarity score between -1 (Very Negative) and +1 (Very Positive).
Categorization: Tweets were classified into three categories based on their polarity score:
- Negative: polarity < 0
- Neutral: polarity = 0
- Positive: polarity > 0

3. Machine Learning Modelling

To improve classification robustness, two machine learning models were trained and evaluated.

Feature Extraction: The cleaned text was vectorized using CountVectorizer, which transforms text into a matrix of token counts.
Models Used:
1. Logistic Regression
2. Linear Support Vector Classifier (Linear SVC)
Training: The dataset was split into a training set (80%) and a testing set (20%).
Hyperparameter Tuning: GridSearchCV was used to find the optimal parameters for each model.

📈 Results & Performance

Sentiment Distribution

Initial analysis with TextBlob revealed the distribution of sentiments across the collected tweets, visualized through bar graphs and pie charts. ![Sentiment Distribution]

Model Performance

The machine learning models achieved moderate accuracy, highlighting the challenge of classifying social media text, which is often short, informal, and context-dependent.

Model	Accuracy	Precision (Weighted Avg)	Recall (Weighted Avg)	F1-Score (Weighted Avg)
Logistic Regression	58.90%	0.61	0.59	0.55
Linear SVC	60.00%	0.62	0.60	0.57

Key Findings:

The Neutral class had the highest recall (~0.86), meaning the model was best at correctly identifying neutral tweets.
The Negative class had the lowest recall (~0.23-0.29), indicating the model often misclassified negative tweets as neutral or positive.
The Positive class was the most challenging, showing lower precision and recall scores. Manual inspection revealed that many "positive" tweets were actually about government promises and plans to combat flooding, which were hopeful in tone but contextually related to a negative event.

Visualization

Word Clouds: Generated for each sentiment class to visualize the most frequent words in Negative, Positive, and Neutral tweets. [Word Cloud for Negative Tweets]


## 🛠️ Installation & Usage

### Prerequisites
*   **Python 3.7+**
*   Required Python libraries: Install via `pip install -r requirements.txt`
    *   `pandas`
    *   `numpy`
    *   `nltk`
    *   `textblob`
    *   `scikit-learn`
    *   `matplotlib`
    *   `wordcloud`

### Running the Analysis
1.  **Clone the repository** and navigate to the project directory.
2.  **Install the required packages** (see above).
3.  **Place your raw Twitter data** in the `Data/Raw/` folder.
4.  **Run the main script:**
    ```bash
    python Scripts/Twitter sentiment analysis for flooding in the Greater Accra Region of Ghana.py
    ```
5.  The script will output:
    *   Cleaned data files
    *   Visualizations of sentiment distribution
    *   Performance metrics and confusion matrices for the ML models

## 🎯 Conclusions & Insights

*   **Public Discourse:** The analysis captured a mix of frustration (negative), factual reporting (neutral), and discussion of solutions (positive) related to flooding in Accra.
*   **Model Limitations:** Achieving high accuracy (~60%) with standard ML models on social media text is challenging due to sarcasm, irony, and complex context. For example, tweets with a positive tone discussing government action were often rooted in a negative flooding event.
*   **Value of Integration:** This sentiment data provides crucial qualitative context to quantitative geospatial flood models, highlighting areas where public concern is highest and where communication efforts might be needed most.

## 🔮 Future Work

*   **Advanced NLP Techniques:** Utilize pre-trained transformer models like BERT or RoBERTa for more context-aware sentiment classification.
*   **Aspect-Based Sentiment Analysis:** Move beyond overall sentiment to identify specific aspects people are talking about (e.g., sentiment on "drainage," "government response," "property damage").
*   **Temporal Analysis:** Scrape data over a longer period to analyze how sentiment shifts before, during, and after major flood events.
*   **Geolocation Integration:** Map sentiments to specific locations within Greater Accra (where possible) to create a spatial sentiment layer.
*   **Multi-Platform Analysis:** Incorporate data from other social media platforms like Facebook and Instagram for a more comprehensive view.


## 🙏 Acknowledgments

- Data sourced via [Apify](https://apify.com/) from X (Twitter).
- The `nltk` and `scikit-learn` communities for providing robust NLP and ML tools.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
Distribution of Sentiments 1.jpg		Distribution of Sentiments 1.jpg
Distribution of Sentiments 2.jpg		Distribution of Sentiments 2.jpg
Model evaluation - Confusion matrix for Logistic regression.jpg		Model evaluation - Confusion matrix for Logistic regression.jpg
README.md		README.md
Text processing - Stemming.jpg		Text processing - Stemming.jpg
Twitter sentiment analysis for flooding in the Greater Accra Region of Ghana.py		Twitter sentiment analysis for flooding in the Greater Accra Region of Ghana.py
Visualization of Tweets.jpg		Visualization of Tweets.jpg
flood_tweets.csv		flood_tweets.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sentiment Analysis of Flood Risk Perception in Greater Accra using X/Twitter Data

📖 Project Overview

🗺️ Study Area Context

📊 Data Collection

⚙️ Methodology & Processing

1. Data Preprocessing

2. Sentiment Classification

3. Machine Learning Modelling

📈 Results & Performance

Sentiment Distribution

Model Performance

Visualization

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Sentiment Analysis of Flood Risk Perception in Greater Accra using X/Twitter Data

📖 Project Overview

🗺️ Study Area Context

📊 Data Collection

⚙️ Methodology & Processing

1. Data Preprocessing

2. Sentiment Classification

3. Machine Learning Modelling

📈 Results & Performance

Sentiment Distribution

Model Performance

Visualization

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages