diff --git a/airline_tweet_data_analysis.md b/airline_tweet_data_analysis.md new file mode 100644 index 0000000..87435f0 --- /dev/null +++ b/airline_tweet_data_analysis.md @@ -0,0 +1,1136 @@ + +# Airline Sentiment Analysis Project + +

Project Objective


+ 1. Analysing data to visualize airline trends + + 2. Classifying whether the sentiment of the tweets is positive, neutral, or negative using Machine Learning Techniques, then categorizing negative tweets for their reason. + +# Data Analysis + + +```python +import pandas as pd ## for reading and undestanding data +import matplotlib.pyplot as plt ## for plotting data +import seaborn as sns ## another library to visualize data features +import numpy as np ## for numerical array processing +``` + + +```python +##reading data +data=pd.read_csv('twitter-airline/Tweets.csv') +data.head() +``` + + + + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
tweet_idairline_sentimentairline_sentiment_confidencenegativereasonnegativereason_confidenceairlineairline_sentiment_goldnamenegativereason_goldretweet_counttexttweet_coordtweet_createdtweet_locationuser_timezone
0570306133677760513neutral1.0000NaNNaNVirgin AmericaNaNcairdinNaN0@VirginAmerica What @dhepburn said.NaN2015-02-24 11:35:52 -0800NaNEastern Time (US & Canada)
1570301130888122368positive0.3486NaN0.0000Virgin AmericaNaNjnardinoNaN0@VirginAmerica plus you've added commercials t...NaN2015-02-24 11:15:59 -0800NaNPacific Time (US & Canada)
2570301083672813571neutral0.6837NaNNaNVirgin AmericaNaNyvonnalynnNaN0@VirginAmerica I didn't today... Must mean I n...NaN2015-02-24 11:15:48 -0800Lets PlayCentral Time (US & Canada)
3570301031407624196negative1.0000Bad Flight0.7033Virgin AmericaNaNjnardinoNaN0@VirginAmerica it's really aggressive to blast...NaN2015-02-24 11:15:36 -0800NaNPacific Time (US & Canada)
4570300817074462722negative1.0000Can't Tell1.0000Virgin AmericaNaNjnardinoNaN0@VirginAmerica and it's a really big bad thing...NaN2015-02-24 11:14:45 -0800NaNPacific Time (US & Canada)
+
+ + + + +```python +data=data[['tweet_id','text','airline_sentiment','airline_sentiment_confidence','negativereason','airline','retweet_count','tweet_created']] +``` + + +```python +data.head() +``` + + + + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
tweet_idtextairline_sentimentairline_sentiment_confidencenegativereasonairlineretweet_counttweet_created
0570306133677760513@VirginAmerica What @dhepburn said.neutral1.0000NaNVirgin America02015-02-24 11:35:52 -0800
1570301130888122368@VirginAmerica plus you've added commercials t...positive0.3486NaNVirgin America02015-02-24 11:15:59 -0800
2570301083672813571@VirginAmerica I didn't today... Must mean I n...neutral0.6837NaNVirgin America02015-02-24 11:15:48 -0800
3570301031407624196@VirginAmerica it's really aggressive to blast...negative1.0000Bad FlightVirgin America02015-02-24 11:15:36 -0800
4570300817074462722@VirginAmerica and it's a really big bad thing...negative1.0000Can't TellVirgin America02015-02-24 11:14:45 -0800
+
+ + + + +```python +data.info() +``` + + + RangeIndex: 14640 entries, 0 to 14639 + Data columns (total 8 columns): + tweet_id 14640 non-null int64 + text 14640 non-null object + airline_sentiment 14640 non-null object + airline_sentiment_confidence 14640 non-null float64 + negativereason 9178 non-null object + airline 14640 non-null object + retweet_count 14640 non-null int64 + tweet_created 14640 non-null object + dtypes: float64(1), int64(2), object(5) + memory usage: 915.1+ KB + + + +```python +semtiments=pd.crosstab(data.airline, data.airline_sentiment) +semtiments +``` + + + + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
airline_sentimentnegativeneutralpositive
airline
American1960463336
Delta955723544
Southwest1186664570
US Airways2263381269
United2633697492
Virgin America181171152
+
+ + + + +```python +negative_tweet=data[(data['airline_sentiment']=='negative')] +negative_tweet.head() +``` + + + + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
tweet_idtextairline_sentimentairline_sentiment_confidencenegativereasonairlineretweet_counttweet_created
3570301031407624196@VirginAmerica it's really aggressive to blast...negative1.0000Bad FlightVirgin America02015-02-24 11:15:36 -0800
4570300817074462722@VirginAmerica and it's a really big bad thing...negative1.0000Can't TellVirgin America02015-02-24 11:14:45 -0800
5570300767074181121@VirginAmerica seriously would pay $30 a fligh...negative1.0000Can't TellVirgin America02015-02-24 11:14:33 -0800
15570282469121007616@VirginAmerica SFO-PDX schedule is still MIA.negative0.6842Late FlightVirgin America02015-02-24 10:01:50 -0800
17570276917301137409@VirginAmerica I flew from NYC to SFO last we...negative1.0000Bad FlightVirgin America02015-02-24 09:39:46 -0800
+
+ + + +Most common words in negative tweets + + +```python +negative_tweet.airline.value_counts() #counts number of negative rate for each airline to identify worse airway of 2015 +``` + + + + + United 2633 + US Airways 2263 + American 1960 + Southwest 1186 + Delta 955 + Virgin America 181 + Name: airline, dtype: int64 + + + + +```python +from wordcloud import WordCloud +def plotWords(words): + wordcloud=WordCloud(width=1200, height=600, random_state=21,max_font_size=110).generate(words) + plt.figure(figsize=(10,7)) + plt.imshow(wordcloud,interpolation="bilinear") + plt.axis('off') + plt.show() +``` + + +```python +neg_tweet_words=negative_tweet.text.values.tolist() +neg_words=' '.join([text for text in neg_tweet_words]) +plotWords(neg_words) +``` + + +![png](output_12_0.png) + + +The plot is showing wich airline service is more tweeted for negative sentiment and reason for negativity. + +Lets look at posetive comments to understand services on which customers are more satisfied. + + +```python +posetive_tweet=data[(data['airline_sentiment']=='positive')] +pos_tweet_words=posetive_tweet.text.values.tolist() +pos_words=' '.join([text for text in pos_tweet_words]) +plotWords(pos_words) +``` + + +![png](output_14_0.png) + + +appreciate, good, thanks, really, great, amazing, best, nice, happy, ... shows services on which customers are ok with airlines. + + +```python +def plot_bar(title,x_label,y_label,data): + fig, ax = plt.subplots(figsize=(10, 3)) + ax.tick_params(axis='x', labelsize=12) + ax.tick_params(axis='y', labelsize=12) + ax.set_ylabel(y_label , fontsize=12) + ax.set_title(title, fontsize=15, fontweight='bold') + _=data.plot(kind='bar') +``` + + +```python +reason_count=negative_tweet['negativereason'].value_counts() +_=reason_count.plot(kind='bar') +``` + + +![png](output_17_0.png) + + + +```python +airline_neg_reason=negative_tweet.groupby('airline')['negativereason'].value_counts() +airline_neg_reason.unstack() +``` + + + + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
negativereasonBad FlightCan't TellCancelled FlightCustomer Service IssueDamaged LuggageFlight Attendant ComplaintsFlight Booking ProblemsLate FlightLost Luggagelonglines
airline
American87198246768128713024914934
Delta64186511991160442695714
Southwest901591623911438611529029
US Airways1042461898111112312245315450
United2163791816812216814452526948
Virgin America1922186045281753
+
+ + + + +```python +def plot_sns(x,y,data): + sns.set(rc={'figure.figsize':(10,10)}) + ax=sns.countplot(y=y,hue=x,data=data) + for p in ax.patches: + patch_height = p.get_height() + if np.isnan(patch_height): + patch_height = 0 + ax.annotate('{}'.format(int(patch_height)), (p.get_x()+0.01, patch_height+0.5),ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points') + plt.title("Distribution of negative reason for each airline") + plt.show() +plot_sns('negativereason','airline',negative_tweet) + +# plt.figure(figsize=(6, 8)) +# splot = sns.barplot(data=df, x = 'sex', y = 'total_bill', ci = None) +# for p in splot.patches: +# splot.annotate(format(p.get_height(), '.2f'), (p.get_x() + p.get_width() / 2., p.get_height()), +# ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points') +``` + + +![png](output_19_0.png) + + +The plot and table above interstingly depicts the United, US, and American airlines has worest service than Delta, Virgin America, and Southwest airlines. Except, Delta and Virgin America airways, the rest four has no good customer handling and United and US airways also mostly late on flight time. Comaratively, Virgin America is good than other and then Delta is next choise. + +# Does flight time has relation to negative reason? + +We will focus on top three airlines with negative sentiment + + +```python +#time based analysis +data['tweet_created']=data['tweet_created'].astype('datetime64[ns]') ## conversion of data type to datetime +data.head() +``` + + + + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
tweet_idtextairline_sentimentairline_sentiment_confidencenegativereasonairlineretweet_counttweet_created
0570306133677760513@VirginAmerica What @dhepburn said.neutral1.0000NaNVirgin America02015-02-24 19:35:52
1570301130888122368@VirginAmerica plus you've added commercials t...positive0.3486NaNVirgin America02015-02-24 19:15:59
2570301083672813571@VirginAmerica I didn't today... Must mean I n...neutral0.6837NaNVirgin America02015-02-24 19:15:48
3570301031407624196@VirginAmerica it's really aggressive to blast...negative1.0000Bad FlightVirgin America02015-02-24 19:15:36
4570300817074462722@VirginAmerica and it's a really big bad thing...negative1.0000Can't TellVirgin America02015-02-24 19:14:45
+
+ + + + +```python +data['tweet_created_date']=data.tweet_created.dt.date +data['tweet_created_weekday_name']=data.tweet_created.dt.weekday_name +data['tweet_created_hour']=data.tweet_created.dt.hour +data.head() +``` + + + + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
tweet_idtextairline_sentimentairline_sentiment_confidencenegativereasonairlineretweet_counttweet_createdtweet_created_datetweet_created_weekday_nametweet_created_hour
0570306133677760513@VirginAmerica What @dhepburn said.neutral1.0000NaNVirgin America02015-02-24 19:35:522015-02-24Tuesday19
1570301130888122368@VirginAmerica plus you've added commercials t...positive0.3486NaNVirgin America02015-02-24 19:15:592015-02-24Tuesday19
2570301083672813571@VirginAmerica I didn't today... Must mean I n...neutral0.6837NaNVirgin America02015-02-24 19:15:482015-02-24Tuesday19
3570301031407624196@VirginAmerica it's really aggressive to blast...negative1.0000Bad FlightVirgin America02015-02-24 19:15:362015-02-24Tuesday19
4570300817074462722@VirginAmerica and it's a really big bad thing...negative1.0000Can't TellVirgin America02015-02-24 19:14:452015-02-24Tuesday19
+
+ + + + +```python +negative_tweet=data[(data['airline_sentiment']=='negative')] +neg_by_wkday = negative_tweet.groupby(['tweet_created_weekday_name']).negativereason.value_counts() +neg_by_wkday.unstack() +``` + + + + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
negativereasonBad FlightCan't TellCancelled FlightCustomer Service IssueDamaged LuggageFlight Attendant ComplaintsFlight Booking ProblemsLate FlightLost Luggagelonglines
tweet_created_weekday_name
Friday7111952207441472375714
Monday1213162277911310512239920745
Saturday55117100313942561355321
Sunday103175210547111057336012139
Thursday5710831197537541243815
Tuesday1122411736151910212826816027
Wednesday61114542401349491428817
+
+ + + + +```python +neg_by_wkday = neg_by_wkday.unstack().plot(kind='line',figsize=(10,5),rot=0,title="Negetive Reasons by Day of Week") +neg_by_wkday.set_xlabel("Day of Week") +neg_by_wkday.set_ylabel("Negative Reason") +``` + + + + + Text(0, 0.5, 'Negative Reason') + + + + +![png](output_24_1.png) + + +The plot clearly depicts expect Friday, Saturday, Thursady and Wednesday flights are comaratively good. Monday, Sunday and Tuesday flights has customer service problem and are mostly late (the green lines also shows that probability of cancelation of flights by Monday, Sunday and Tuesday is high). + + +```python +neg_by_time = negative_tweet.groupby(['tweet_created_hour']).negativereason.value_counts() + +neg_by_time = neg_by_time.unstack().plot(kind='line',figsize=(10, 5),title="Negetive Reasons by Hour") +neg_by_time.set_xlabel("Time") +neg_by_time.set_ylabel("Negative Reason") +``` + + + + + Text(0, 0.5, 'Negative Reason') + + + + +![png](output_26_1.png) + + +Time based analysis is showing something good look to optimize airline service. + + +Flights at time range 0:00 A.M -03:00 A.M and 04:00 PM - 06:00 PM are with high customer dististfaction. diff --git a/index.md b/index.md index d13461d..333205a 100644 --- a/index.md +++ b/index.md @@ -1,16 +1,14 @@ -This repository is my data science portifolio repository that I have created through self-directed learning. The repository contains data analysis, computer vision, NLP, and A/B testing projects I have created. +This repository is my data science portifolio repository that I have created through self-directed learning. The repository contains EDA, computer vision, NLP, and A/B testing projects I have created. -### Data Analytics +### Exploratory Data Analysis (EDA) 1. Exploring Major Cities Health Indicators -2. Airline Tweet Analysis to discover negative opinions of passengers towards service improvement +2. Airline Tweet Analysis to discover negative opinions of passengers towards service improvement 3. Simple Tweeter Data Analysis -4. Dealing with unbalanced data +4. Soccer Data Analysis ### Natural Language Processing 1. Amharic Word Embedding -2. Sentiment analysis +2. Airlines Sentiment analysis with CNN, RNN, and BERT 3. Amharic simplle text preprocessing -4. Text Similarity -5. Tranisfer Learning with Universal Encoders ### Computer Vision Projects 1. Amharic Character recoginition 2. Malaria Microscopic cell pathogenic object detection diff --git a/output_12_0.png b/output_12_0.png new file mode 100644 index 0000000..36048be Binary files /dev/null and b/output_12_0.png differ diff --git a/output_14_0.png b/output_14_0.png new file mode 100644 index 0000000..622db35 Binary files /dev/null and b/output_14_0.png differ diff --git a/output_17_0.png b/output_17_0.png new file mode 100644 index 0000000..7c84344 Binary files /dev/null and b/output_17_0.png differ diff --git a/output_19_0.png b/output_19_0.png new file mode 100644 index 0000000..744e887 Binary files /dev/null and b/output_19_0.png differ diff --git a/output_24_1.png b/output_24_1.png new file mode 100644 index 0000000..cc8908c Binary files /dev/null and b/output_24_1.png differ diff --git a/output_26_1.png b/output_26_1.png new file mode 100644 index 0000000..cb253da Binary files /dev/null and b/output_26_1.png differ