The project extracts the data from Twitter and NEWS API to perform the semantic and sentiment analysis. It performs the sentiment analysis on Twitter data at Twitter/cleaned_twitter_data and semantic analysis on NEWS data at News/News Articles. The Twitter data was extracted with the help of Tweepy. The NEWS API was extracted using NEWS API. This project was developed towards the fulfillment of the course CSCI 5408 - Data Management, Warehousing and Analytics. The Report.pdf contains the summary of the project.
The Twitter folder contains scripts like cleaning_twitter_data.py, twitter_data.py, etc., which helps to extract data using the API and pre-process it. For the pre-processing, it removes any metadata, URLs, and special characters. The polarity_twitter_data.py counts the number of positive and negative words in order to find the polarity of the tweets. It performs it by converting the tweets into bag-of-words and comparing them with the list of positive and negative words. The below figure shows the visualization of word-cloud for positive and negative words occuring in the tweets using Tableau.
The News folder contains scripts like cleaning_news_data.py, news_data.py, etc., which helps to extract data using the API and cleans the data by removing any URLs. The Semantic_Analysis.py converts news data from cleaned_news_data.csv and stores into different files at News/News Articles. Then, it finds TF-IDF and number of occurrences of word Canada in documents. It then counts the total words in a document and calculates relative frequency. The relative_frequency.csv contains the information about frequency and relative frequency of word Canada in every documents it occurs. At last, the script prints the article which has the highest relative frequency as shown in below figure.
