AffectAnalysisToolEvaluation/samplebody-conf.tex at master · DeveloperLiberationFront/AffectAnalysisToolEvaluation · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\section{Introduction}
\begin{comment}


\end{comment}

A programmer's mood affects their activities and performance as programming involves various forms of cognitive tasks~\cite{khan2011moods}.
The collaborative nature of today's software development
means that a developer's mood can be affected by other
developers, and conversely, collaborating developers can be
affected by one developer's
mood~\cite{murgia2014developers,graziotin2014happy,curtis1988field}.
Studies in other domains also show that organizational events can cause affective reactions which in turn influence performance and job satisfaction~\cite{parkinson1996changing}.

Therefore, software engineering researchers have increasingly studied emotion in software engineering in recent years~\cite{jongeling2017negative}.
Recent studies involve
mining software artifacts,
understanding signals of human emotions hidden in those artifacts, and
then analyzing the signals in automated ways.
The findings of these studies could be used by teams to
track mood and formulate new strategies to ensure
a healthy environment.


Researchers often use natural language processing tools to capture the emotion in a team.
One place where teams express emotion is on social collaborative sites
like GitHub, where
developers communicate with each other to maintain their projects~\cite{storey2010impact}.
When developers use these sites,
researchers can analyze the detailed history of project development,
as well as developers' communication about the project in the form of issues, for example.
Such textual artifacts present the opportunity
to use natural language processing tools
to do different affect analyses
for different purposes~\cite{ortu2015bullies,ebert2017confusion,gachechiladze2017anger}.
While analyzing sentiment is the most commonly used technique to measure emotions,
researchers have also used natural language tools to measure
politeness~\cite{ortu2015bullies}, which is an important
factor in the on-boarding process~\cite{steinmacher2015social}.
Thus, in this paper, we focus on sentiment and
politeness  analysis tools.

% maybe ~\cite{montoyo2012subjectivity} somehow? no need
Despite the use of automated tools for analyzing sentiment and
politeness, significant questions remain about the tools' reliability.
Researchers who focus on sentiment analysis have provided strong data
that the tools are less effective when applied to new domains they
were not trained on~\cite{novielli2015challenges,gamon2005pulse}.
In our domain, Jongeling and colleagues have
shown that different sentiment analysis tools yield different
results when used on data from a JIRA repository~\cite{jongeling2017negative}.
%also tools for StackOverflow might not work on issues (for example). no nedd
Hence, understanding the reliability of these tools will help
software engineering researchers know whether they can use these
tools to reach conclusions confidently.

In this paper, we study the reliability of popular sentiment analysis
and politeness tools in the context of developer discussions.
To do so, we randomly chose 589 comments from pull requests and issues on GitHub,
manually annotated them for sentiment and politeness using human coders,
and compared those annotations against the results produced by the tools.

The major contributions of this paper are:
\begin{enumerate}
    \item A benchmark of 589 GitHub comments,
    hand-rated by humans for sentiment and politeness,
    \item An annotation scheme that can be used to rate
    the politeness of comments in discussions, and
    \item A reliability evaluation of six sentiment analysis
    tools and one politeness tool.
\end{enumerate}

The dataset, coding schemes, and other associated files have been made publicly available at \url{https://github.com/DeveloperLiberationFront/AffectAnalysisToolEvaluation}.
%todo make the repo public


\section{Related Work}\label{relwork}
%-- give evidence of infinite dimensions of emotions-- not needed
%-- why is github an important domain to look at
\subsection{Sentiment Analysis in Software Engineering}\label{rwsent}

Sentiment analysis focuses on the application of
classifying texts as to their polarity
(positive, negative or neutral)~\cite{pang2008opinion}.
Using data from GitHub,
Guzman and colleagues applied sentiment analysis
to study commit comments~\cite{guzman2014sentiment},
while Pletea and colleagues tried to find a correlation between security-related discussions and fluctuating sentiments~\cite{pletea2014security}.
Other platforms,
including the Gentoo community and Stack Overflow,
have also been studied to understand developers'
sentiments~\cite{garcia2013role,islam2016towards,guzman2013towards,novielli2014towards}.

While most of these works have used common tools,
research on specialized tools for software engineering domain
is ongoing.
Customized tools try to overcome the problems
of the prior common tools
that were trained on texts
from unrelated domains
by having their own sentiment oracle
for the software engineering domain~\cite{ahmed2017senticr,calefato2017sentiment}.
However, they have used different or no coding schemes
while annotating the texts by human raters,
which might lead them to have subtle differences in their understanding of sentiment in this domain.

\subsection{Politeness in Software Engineering}

Politeness can be described as
"the practical application of good manners or etiquette"~\cite{wiki:pol}.
Politeness is a strong factor in social collaborations~\cite{ortu2015would,wang2008politeness}.
Research is also going on in the software engineering domain
to study the impact of politeness expressed by developers.
For example, Ortu and colleagues concluded that
"the more polite developers were,
the less time it took to fix an issue",
while Tsay and colleagues studied GitHub discussions
and found that,
"the level of a submitter's prior interaction on a project changed how politely developers discussed the contribution and the nature of proposed alternative solutions"~\cite{ortu2015would,tsay2014let}. However, similar to sentiment analysis, different studies had different methods to rate politeness~\cite{tsay2014let,brownsoftware}.


\subsection{Challenges for Sentiment \& Politeness Analysis in Software Engineering}

Many researchers are planning to study
the emotional awareness in collaborative software engineering~\cite{dewan2015towards}.
However, challenges lie in correctly identifying any type of affect before we can use the data for further analysis.
A recent work explores the possibilities of correctly judging emotions from texts extracted from software engineering platforms~\cite{murgia2014developers}.
The authors found
poor to moderate agreement between human raters
on different types of emotions.
They conclude that
``more investigation is needed before
building a fully automatic emotion mining tool.''
Besides, Novielli and colleagues point towards
the domain dependency of existing tools
which makes it harder to apply them in new corpora~\cite{novielli2015challenges}.

Jongeling and his colleagues have shown
how different tools lead to different results of sentiment
in a new data set of JIRA issue reports~\cite{jongeling2017negative}. They also show that ``this disagreement [between different tools]
can lead to diverging conclusions
and that previously published results cannot be replicated
when different sentiment analysis tools are used.''
Another work also used existing tools
on code review comments from Gerrit
and found a poor performance by the tools~\cite{ahmed2017senticr}.
We find similar results also in Lin and colleagues' work~\cite{lin2018sentiment}.
This establishes that the reliability tools should be established
before the tool can be used for further analysis.

Just as Jongeling and colleagues
evaluated the reliability of four sentiment analysis tools,
we also perform a similar evaluation
over a new dataset of GitHub comments
with an addition of two tools
that are specifically built for the software engineering domain. Moreover, we also extend our evaluation
towards the reliability of politeness analysis
on GitHub comments.


\section{Methodology}

Our goal is to evaluate the reliability of sentiment and politeness analysis tools in developer discussions by examining the tools' performance over GitHub comments.
Before we evaluate the reliability of tools for doing affect analysis,
we first need to define our gold standard for evaluation.
Like other works~\cite{calefato2017sentiment,ahmed2017senticr}, we use human coders to create
this gold standard.
However, before taking human ratings at face value, we first ask
to what extent human coders agree with each other:

\begin{itemize}

\item\textbf{RQ1.} How consistent are \emph{human coders} to rate sentiment and politeness on GitHub comments?

\end{itemize}

\noindent
Our next research questions evaluate the tools:

\begin{itemize}

\item\textbf{RQ2.} How reliable are \emph{sentiment analysis tools} on GitHub comments?

\item\textbf{RQ3.} How reliable are \emph{politeness analysis tools} on GitHub comments?

\end{itemize}

\subsection{Data Collection \& Manual Raters}\label{data}

% % There are many ways for a developer to put comments in GitHub, e.g. commit comments, comments in pull request review and issue discussion boards. However, while commit comments may contain texts for documentation purpose; pull request review and issue comments are mostly discussion between developers. Hence, the later is a more standard metric of interaction between developers. In our study, we only evaluate comments from pull request review and issue section in GitHub.
% under pull request review and issue section
We chose GitHub as our research setting
because it is the largest code host
in the world~\cite{gousios2014lean}.
The GitHub  documentation designates
issue and pull request sections
as the appropriate place for
both general and specific project discussion
 between developers.\footnote{\url{https://guides.github.com/features/issues/}}
 Hence, we chose comments from these discussion threads
 to evaluate our tools on.

% We chose comments from GitHub because it is the largest code host in
% the world~\cite{gousios2014lean}, and is regularly used in software
% engineering studies. Because the GitHub documentation designates
% issue and pull request sections as the appropriate place for both general and specific project discussion between developers, we investigated these discussion threads for our test corpus.

To get the comments,
we use the GHTorrent project.\footnote{http://ghtorrent.org/}
However, while GHTorrent does archive code-review comments,
it does not contain the general discussion comments
under issues and pull requests.
Furthermore,
GitHub comments may contain code snippets.
%many of the comments that GHTorrent does have
%contain code snippets that affect analyzers cannot correctly process.
To remove those code snippets,
we need to be able to detect the HTML tags around the code,
assuming that the authors highlighted them.
For these purposes,
we augment our data by mining GitHub's web pages
to acquire the comments from the issue and pull request review sections.
This way, we can also get the HTML tags.

As manual annotation is a time-consuming task,
we estimated
every hundred comments
would take one hour to complete by one person.
For each comment,
we would have two human coders to
manually go through and annotate,
as a previous study found that
"having more than two raters
does not change
the agreement significantly"~\cite{murgia2014developers}.
We estimated that annotation of 600 comments
can be completed within reasonable time and effort.
Adding 40 comments more as a cautionary step,
we randomly picked 640 comments
from all the public projects from the data we collected.
To estimate how much these comments
represent developer discussions,
We randomly picked
20 comments out of these 640
and manually investigated their origins.
All 20 comments came from different projects
and the projects appear to represent valid software development.
18 out of these 20 projects have a clear description
of the project in their README files.
Out of these 20 projects,
4 projects had only 1 contributor,
3 projects had 2 contributors,
and 1 project had 4 contributors.
The remainder of the 12 projects had at least 15 contributors.

During pre-inspection,
we had to remove 51 comments as they were
either incomplete fragments,
duplicates,
code snippets without commentary,
non-English statements,
or parts of discussions that are no longer available
(due to, for example, invalid URLs).
The rationale behind the last factor is that
if the coders want to see the whole discussion
of which a comment belongs to
in order to have an understanding
of the underlying context,
they wouldn't be able to do so.
We provide the final 589 comments to the human coders
removing only the HTML tags.
However, before feeding these comments to the tools,
we also remove the code snippets and URLs,
as they would get incorrectly processed by the tool,
which
wrongly considers them English texts
and hence may bias the result.
The raters were also provided with the URL of the webpage
containing the comment.
Also, while we kept emoticons in the comments,
we had to strip out the emojis
before providing the comments
to the human coders and the tools.
The emojis in the GitHub comments
are stored images
that get fetched from a central hosting location.
While we could replace the emojis
by their titles using the images' HTML tags,
the tools would have been unable to detect them,
as they are not trained in this way.
So we removed all the emojis
from the comments in our test data.

While the first coder (Coder 1) rated all the comments,
we had two other coders
who went through half of the data set each individually.
All the coders were graduate and undergraduate students
of the Computer Science department.
We provided a coding guideline
to the raters before providing the comments,
explained in Sections~\ref{sentscheme} and~\ref{polscheme}.
The human coders had a short practice set
consisting of 10 sample comments
before starting with the main data set.

After the first iteration,
where everybody separately annotated the texts,
Coder 1 sat with each of the other coders
to discuss the comments they disagreed on
and their rationale behind their rating.
This resulted in reaching consensus
for the initially disputed comments
and bringing out insights over our coding guidelines.


\begin{comment}
 and try to determine if they give satisfactory results on GitHub comments. This leads to our first research question,
\newline
\newline
\textbf{RQ1 - How do popular sentiment analysis tools fare on real-life GitHub comments?}
\newline
\newline
 Thus leading us to our second research question,
\newline
\newline
\textbf{RQ2: How does popular politeness analysis tool fare on real-life GitHub comments? }
\newline
\newline
\end{comment}


\subsection{Sentiment Annotation Scheme}\label{sentscheme}


As mentioned in Section~\ref{rwsent},
previous work has used different or no coding schemes
while doing sentiment rating in the software engineering domain.
To not deviate a lot from the previous works,
but still giving our coders a basic set of strategies,
we make use of the simple annotation scheme
from a prior work developed by Mohammad~\cite{mohammad2016practical}.
Based on this scheme,
the raters were asked to label each comment as
either positive, negative, neutral, mixed or sarcasm.
The "mixed" here stands for a text containing both types of emotions. Our tools label texts as "neutral"
where opposite emotions counter each other.
From that perspective,
we eventually classified the texts as "neutral"
which were given a "mixed" rating by the human raters.
We omitted the texts rated "sarcasm" by the human raters
from our test data as there is no clear guideline
on how to match its label with the
possible ratings the tools have as output.

\subsection{Politeness Annotation Scheme}\label{polscheme}

We could not find any commonly used politeness scheme
to rate textual documents.
So, we developed and began with an experimental coding scheme\footnote{https://tinyurl.com/ycts7vhr}
based on the work of
Brown and Levinson's politeness theory~\cite{brown1987politeness}
and Culpeper's work of impoliteness~\cite{culpeper1996towards}.
From these theories,
we select the strategies that are relevant to written communication (filtering out the non-verbal cues of politeness).
The strategies depend on the theory of "face",
where positive face points towards the desire
to be liked or appreciated or approved etc.
and negative face is the desire
to not to be imposed upon, intruded, or otherwise put upon.

\FloatBarrier
\begin{table}[H]
	\centering
	\caption{Coding Strategies for Politeness}
	\label{poltable}
	% \begin{threeparttable}
	\begin{tabular}{ | m{1.75cm} | m{2cm}|| m{2cm} | m{1.75cm} | }
		\hline
		\multicolumn{2}{|c||}{Politeness} & \multicolumn{2}{c|}{Impoliteness} \\
		\hline
		Strategies & Examples & Strategies & Examples \\
		\hline
		Positive Politeness: Helping hearer's positive face  & Gratitude: "I really appreciate it" & Positive Impoliteness: threatening hearer's positive face & Seeking disagreement: "I don't agree with this style of coding" \\
		\hline
		Negative Politeness: Helping hearer's negative face & Use of Verbal Hedges: "I suggest we write it this way" & Negative Impoliteness: threatening hearer's negative face  & Direct accusation: "You made these faulty changes"\\
		\hline
		Indirect Politeness: Offering advice through indirect implication & "It's really cold here" could imply a request to shut the window down & Sarcasm or Mock Politeness & Having sarcastic tone: "static??? really???" \\
		\hline
		\multicolumn{2}{c|}{--} & Withhold Politeness:\tablefootnote{Excluded in the modified scheme after the experiment} Absence of politeness where it is expected & Depends on context: Failing to thank somebody after help\\
		\cline{3-4}
		% \begin{tablenotes}
		% % \item[1] Excluded in the modified scheme after the experiment
		% \end{tablenotes}
	\end{tabular}
	% \end{threeparttable}
	%\footnotetext{Excluded in the modified scheme after the experiment}
\end{table}

However, this being an experimental coding scheme,
the raters were given the web page
of the discussion thread containing the comment
and were encouraged to dive
into the detail of the context of the whole conversation
and use their own judgment while rating.
Additionally, they were asked to give remarks
if they found any criterion
that are not covered by the scheme
or a strategy in the scheme that
does not align with the context in which the comment appears.
The major criteria of our initial coding scheme
are shown in Table~\ref{poltable}, along with examples.
Based on this scheme,
the coders were asked to rate each text as
very polite, polite, neutral, impolite, and very impolite.
To measure the degree of politeness/impoliteness,
the coders were suggested to consider
the frequency of strategies used in the comment
according to the scheme
alongside their own judgment and the underlying context.

As mentioned in Section~\ref{data},
after the first iteration of rating,
the Coder 1 had a discussion with the other coders
about judging criteria
and reaching a conclusion about the disputed comments.
Upon the discussion,
we also modify our coding scheme and
propose a complete and final one
which is presented in Section \ref{finalpolscheme}.
From the discussion,
the coders also agreed that the initial coding scheme
was not very clear on how to distinguish
between the degrees of politeness/impoliteness,
hence we got varying judgment on them.
For the purpose of this study,
we, therefore, merged
"very polite" and "polite" into one single "polite" group,
and "very impolite" and "impolite" into an "impolite" group.

\subsection{Tool Selection}
Sentiment analysis is an active field of research
%in the past decade
and also possesses commercial interest.
Hence there are a lot of tools available.
Jongeling and colleagues point out four tools
as the most commonly used for sentiment analysis
across all the domains
including software engineering research
(SentiStrength, NLTK, Alchemy, Stanford NLP).
These tools, however,
are trained on corpora
that are not related to software engineering.
Therefore, we take two other tools that have been trained on a dataset relevant to developer discussions (Senit4SD, SentiCR).
% We also take another tool from a recent paper, Senti4SD~\cite{calefato2017sentiment}, into consideration as it is claimed to be "specifically trained to support sentiment analysis in developers' communication channels". We select these 5 tools as popular sentiment analysis tools.
\newline
\indent \textbf{SentiStrength:} Thelwall and colleagues' SentiStrength~\cite{thelwall2010sentiment}
was developed based on MySpace comments
and has a ``word strength list''
at the core of its algorithm.
The tool has been frequently used in software engineering research~\cite{garcia2013role,guzman2014sentiment,novielli2015challenges,guzman2013towards,sinha2016analyzing}.
SentiStrength assigns
an integer value between 1 to 5 for positive sentiment
and -1 to -5 for negative sentiment.
We add both the rating for a text,
and identify "positive" sentiment if the sum is greater than zero, "negative" if less than zero
and "neutral" otherwise.
This interpretation was used in Jongeling's work~\cite{jongeling2017negative}.
\newline
\indent\textbf{Alchemy:} IBM's Alchemy
provides a text-processing API
which returns a label for sentiment
(positive/neutral/negative).
\newline
\indent\textbf{NLTK:} Bird and colleagues' NLTK~\cite{bird2009natural},
which uses multiple corpora in its development,
has also been used in previous works
in the software engineering domain~\cite{pletea2014security,rousinopoulos2014sentiment}.
We use an API provided in \href{www.text-processing.com}{www.text-processing.com}
to use this tool.
For each text, it also returns a sentiment label
(positive/neutral/negative).
\newline
\indent\textbf{Stanford NLP:} Socher and colleagues' Stanford NLP
divides the text into sentences,
and performs a more advanced grammatical evaluation
on each sentence
by generating a sentiment treebank
through Recursive Neural Tensor Networks~\cite{socher2013recursive}.
It was trained on movie review excerpts
from the \url{rottentomatoes.com} website.
It returns an integer score for each sentence
(0 for neutral, 1 for positive, -1 for negative)
indicating its sentiment label.
We rate a text under "positive" or "negative"
based on the category that has greater number of sentences
and "neutral" otherwise.
This approach is similar to that of SentiStrength.
\newline
\indent\textbf{Senti4SD:} This tool,
developed by Calefato and colleagues~\cite{calefato2017sentiment}, classifies each text by labeling sentiment
as positive, negative or neutral.
The tool was built using
questions, answers and comments
from StackOverflow as its training data.
\newline
\indent \textbf{SentiCR:} Ahmed and colleagues' SentiCR~\cite{ahmed2017senticr}
is trained on a dataset
that comprises of
code review comments from Gerrit and
two other datasets developed
in prior work~\cite{calefato2017sentiment,ortu2016emotional}
using supervised learning techniques.
Based on the oracle,
the tool also return a sentiment label
for a text
(positive, neutral and negative).
\newline
\newline
For politeness analysis, we use a tool developed by Danescu-Nicu\-lescu-Mizil and colleagues~\cite{danescu2013computational}.
While the tool was trained on short texts
that represent requests/questions,
the tool identifies general patterns of politeness
in written texts and
has been used in recent software engineering studies.
Ortu and colleagues have used this tool
over a dataset collected from
the Apache Software Foundation Issue Tracking system,
a JIRA repository~\cite{ortu2015would,ortu2015bullies}.
They also show that sentiment and politeness
are independent metrics having a weak correlation~\cite{ortu2015bullies}.
As the tool does not have any distinct name,
we will simply call it the "politeness tool".

\textbf{Politeness tool:} This tool~\cite{danescu2013computational} tries to measure politeness based on
"domain-independent lexical and syntactic features
operationalizing key components of politeness theory,
such as indirection, deference, impersonalization and modality".
It was trained and tested
on different corpora (Wikipedia and Stack Exchange)
and hence has the claim of being domain independent.
The tool ``predicts a politeness score between 0 to 1''
for each text.


\section{Results}

\subsection{RQ1 }\label{gt}

In the first iteration,
we had each comment rated by two coders individually
on both sentiment and politeness.
If for any comment,
two coders agreed about the rating,
we finalized that label for the comment.
This resulted in 374 comments for politeness
and 285 comments for sentiment.
We call this the first set of data as the "agreed" set.
In the second iteration,
the Coder 1 had a discussion
with each of the two other human coders.
From this discussion,
a conclusive rating was achieved
for each of the initially disputed comments.
This resulted in the "final" data set
containing the newly negotiated ones along with the agreed ones.
This procedure aligns with a previous paper~\cite{ahmed2017senticr}. Also, during this discussion,
some insights about how the comments should be rated
both on "politeness" and "sentiment" were noted.
These findings have been presented in Sections \ref{finalpolscheme} and \ref{sentschemedis}.
We used both the initially agreed and final set of data
to evaluate our tools for RQ2 and RQ3.
The amount of test data is comparable
with the data set of a previous study
which took 265 labeled texts to
perform the evaluation of sentiment analysis tools~\cite{jongeling2017negative}.

%As mentioned before, all of the 589 comments were annotated by the Coder 1, and two other coders annotated one half of the dataset each.
Table \ref{interrater} shows the inter-rater agreement
during manual rating.
We used Weighted Cohen's Kappa
because our classes in both of the ratings
have an implicit ordering.
When two coders labeled different polarity
for their ratings
(positive vs negative and polite vs impolite),
we counted it as a strong disagreement (weight=2).
And when one of the labels was neutral,
and other was a polar one,
we counted it as a weak disagreement (weight=1).
Coder 1 had a fair and moderate agreement
with one of the coders (Coder 2)
for sentiment and politeness respectively and
fair with the other coder (Coder 3) for both the ratings
(according to magnitude guidelines presented by Landis and Koch~\cite{landis1977measurement}).
This agreement rate is similar to what
Murgia and colleagues found in their work
for emotions like love, sadness, fear, and joy~\cite{murgia2014developers} and
points towards the subjectivity of these affects.
However,
Coder 1's agreement with both the other coders
on politeness
is higher than on sentiment.
One reason behind this can be
that the annotation scheme for politeness
provided to the coders
was more detailed with relevant examples
than the simple scheme for sentiment.

\vspace{3mm}
\noindent\fbox{%
    \parbox{\linewidth}{%
       \textbf{The $\kappa=.27$ to $.48$ agreement suggests that even human ratings had low sentiment and politeness consistency on GitHub comments.}
    }%
}
\vspace{3mm}


\begin{table}
\centering
\caption{Inter-rater Agreement (Weighted Cohen's Kappa)}
  \label{interrater}
  \begin{tabular}{|c|c|l|}
    \toprule
    Weighted $\kappa$ & Coder 1 and Coder 2 &  Coder 2 and  Coder 3\\
    \midrule
    Sentiment & .38 & .27 \\
    \hline
    Politeness & .48 & .36\\
  \bottomrule
\end{tabular}
\end{table}

% The author then sat with each of the other coder to discuss about the disagreed comments. Then upon discussion, the raters came for a final rating for all of the comments. now pos was , neg was and neu. similarly pol, neu and impolite was . The raters also discussed the reason behind choosing the final rating for them and how the annotation scheme should be modified or added to improve.
While we had substantial amount of disagreement
between the coders in the first iteration
and hence found the inconsistency of human rating,
the coders reached a conclusion for the disputed comments
through discussion afterwards.
This way, we completed a data set of comments
that are hand-rated by humans and
use it as a benchmark for our evaluation in RQ2 and RQ3.

\subsection{RQ2}
The initially agreed dataset contained
66 positives, 20 negatives, 198 neutral and 1 sarcasm
out of total 285 comments.
And for the final dataset,
we got 93 positives, 73 negatives, 419 neutral and 4 sarcasm
out of 589 comments.
As our dataset is heavily "neutral" biased,
only F-measures for the tools would be a misleading metric
as the agreement by chance would be high.
So, we also calculated Weighted Cohen's Kappa as before,
to compare the tools' output with the final human ratings.
We got an output for each comment by all the tools except Alchemy, which failed to give an output for 46 and 70 comments respectively for the agreed and the final data set.
For the first set of agreed data in Table \ref{sentfirst},
Senti4SD turned out to have the best performance among the tools.
One point to note from the results is
the relatively low F-measures with a low recall
of negative comments
for all the tools compared to
neutral and positive comments.
This confirms the "negative bias" of the existing tools,
that is the misclassification of neutral technical texts
as emotionally negative~\cite{blaz2016sentiment,novielli2015challenges,calefato2017sentiment}.
In Table \ref{sentfinal}, we show the results
when we compare the tools with the final dataset.
In this step,
we see a major drop in the performance
of all the tools
except for SentiCR
and a similar pattern of negative comments
still doing worse.
The performance drop may indicate
that the initially disputed comments
are harder to rate.
%and there remain subtle aspects of subjectivity in these comments.
However, the tools' performance
over both the dataset
tells us about their overall unreliability.


\vspace{3mm}
\noindent\fbox{%
    \parbox{\linewidth}{%
       \textbf{The $\kappa=.16$ to $.33$ agreement suggests that tools had low sentiment reliability on GitHub comments.}
    }%
}

\begin{table}
\centering
\caption{Performance of Sentiment Analysis Tools for the First Set of  Agreed Data}
\label{sentfirst}
\begin{tabular}{|c|c|c|c|c|}
\hline
\multicolumn{2}{|c|}{ } & \multicolumn{3}{c|}{ F-measure } \\
\hline
tools & Weighted $\kappa$ & neutral & positive & negative \\
\hline
SentiStrength & 0.44 & 74.58\% & 61.54\% & 40.68\% \\
\hline
NLTK & 0.27 & 52.63\% & 55.42\% & 23.73\% \\
\hline
Alchemy & 0.33 & 58.78\% & 58.21\% & 32.65\% \\
\hline
Stanford NLP & 0.26 & 55.85\% & 59.83\% & 19.44\% \\
\hline
Senti4SD & 0.53 & 84.92\% & 68.18\% & 41.03\% \\
\hline
SentiCR & 0.13 & 57.04\% & 29.70\% & 29.91\% \\
\hline
\end{tabular}
\end{table}

\begin{table}
\centering
\caption{Performance of Sentiment Analysis Tools for the Final Set of Data}
\label{sentfinal}
\begin{tabular}{|c|c|c|c|c|}
\hline
\multicolumn{2}{|c|}{ } & \multicolumn{3}{c|}{ F-measure } \\
\hline
tools & Weighted $\kappa$ & neutral & positive & negative \\
\hline
SentiStrength & 0.24 & 65.68\% & 43.01\% & 30.85\% \\
\hline
NLTK & 0.24 & 45.38\% & 45.26\% & 33.86\% \\
\hline
Alchemy & 0.23 & 51.56\% & 41.67\% & 34.48\% \\
\hline
Stanford NLP & 0.16 & 43.05\% & 45.09\% & 26.41\% \\
\hline
Senti4SD & 0.33 & 79.04\% & 50.47\% &  31.25\% \\
\hline
SentiCR & 0.24 & 64.87\% & 43.48\% & 31.02\% \\
\hline
\end{tabular}
\end{table}

\subsection{RQ3}

For politeness,
the initially agreed dataset contained
153 polite, 21 impolite and 200 neutral
out of 374 comments
while the final dataset contained
221 polite, 46 impolite and 322 neutral
out of 589 comments.
As the politeness tool
gives a score between 0 to 1
for each text for its politeness,
we have to first select a threshold
where we can separate the polite, impolite and neutral comments.
Over the first data set, we calculated F-measures
at every .01 interval within the 0--1 range.
We found the highest F-measure
being 69.61\%  for polite at a .57 threshold
and 77.05\% for neutral at a .65 threshold.
So, intuitively the ideal threshold
to separate polite vs neutral
would fall somewhere between that range.
Our results align with the findings of Jongeling and colleagues,
who found this threshold to be at .611.
So, we use this threshold and
rate all those comments as "polite"
for which the tool gives a higher rating than .611.

However, we found
very poor results
for impolite comments
as F-measures at all the ranges were below 20\%.
Thus, we concluded that
the tool has low reliability
in deciding if a comment in our data set is impolite or not.
Therefore, we decided to merge the "neutral" and "impolite"
into one single class that is "non-polite".
Thus, eventually, we use the tool
for a binary classification of the texts
between polite and non-polite.
We use both the agreed and final data set
to evaluate the single tool.
Similar to RQ1,
we use Cohen's kappa (unweighted) and F-measures as our metrics. Table \ref{polresult} shows that this tool also
has "moderate" agreement over the first agreed data set
while a "fair" agreement over the final data set.

\vspace{3mm}
\noindent\fbox{%
    \parbox{\linewidth}{%
       \textbf{The $\kappa=.39$ agreement suggests that the tool had low politeness reliability on GitHub comments.}
    }%
}

\begin{table}
\centering
\caption{Performance of Politeness Analysis tool}
\label{polresult}
\begin{tabular}{|c|c|c|c|c|c|}
\hline
\multicolumn{3}{|c|}{ First Data set }& \multicolumn{3}{c|}{ Final Data set } \\
\hline
 & \multicolumn{2}{c|}{ F-measure } & & \multicolumn{2}{c|}{ F-measure } \\
 \hline
$\kappa$ & polite & neutral & $\kappa$ & polite & neutral \\
\hline
.49 & 67.15\% & 75.94\% & .39 & 60.24\% & 78.37\%\\
\hline
\end{tabular}
\end{table}

\section{Discussion}
\subsection{Final Politeness Coding Scheme} \label{finalpolscheme}

The coders discussed and then changed
the initial politeness coding scheme.
We differentiated between implicit politeness
and \textbf{explicit markers} of politeness
within a text.
For example, one coder rated a comment polite
if he felt that the commenter is giving a detailed explanation
of something and
hence putting more effort into the communication.
All the coders agreed that
this is an implicit form of politeness.
Another example would be the fourth strategy of impoliteness
in our initial coding scheme which
marks the absence of politeness where it is expected
as a form of impoliteness.
All the coders agreed that
these are the implicit forms of politeness/impoliteness
and cannot be clearly derived from a single comment
without knowing the whole underlying context.
Whereas, other strategies involve "explicit markers"
like verbal hedges, indirection, and use of positive lexicons.
So for the comments that were initially disputed,
we only counted those as `polite'
which have an explicit marker of politeness
within the text itself.

Based on our discussion,
we fully develop a coding scheme
which has less ambiguous steps to detect politeness.
We also give direction on
how to rate the degree of politeness/impoliteness
and what to do with the mixed comments.
Although we did not use this final modified scheme in our work,
we hope that this scheme will come useful in future works
and help in maintaining a clear set of standards
in rating politeness for texts found in online conversations.
The scheme is also publicly available at \href{https://tinyurl.com/y8g3b8zn}{https://tinyurl.com/y8g3b8zn}.

\subsection{Insights on Sentiment Coding}\label{sentschemedis}
The main confusion
while rating the sentiment
came from how to rate the statements of technical details
containing \textit{bug reports} or \textit{bug fixes}
(e.g. "fixed", "done")
which are commonplace in software engineering and
don't necessarily express an emotion.
One coder initially rated all texts indicating bug fixes
as positive
as these comments mark positive results.
However, the other coder rated the same texts as neutral
unless they are explicitly emotive.
Similar confusion arose with bug reports or crash reports.
Upon discussion, the coders agreed that
these texts are samples of normal day-to-day technical details
in a software project.
The coders reached a consensus
that if a text doesn't explicitly show an emotion,
we would rate it as neutral.
%The initially disputed comments were resolved on such conclusions.
This confusion points
to the necessity of a customized annotation scheme
for software engineering.

\subsection{Tools' failure: Case analysis}
%To pinpoint the weaknesses in automated rating, we looked at the comments where the tools had a different rating from the humans.  For the sentiment analysis tools, we perform a case study on Senti4SD, which had the best performance.

To investigate why the tools had poor results,
we looked at the comments where
Senti4SD and the politeness tool
gave different ratings than the humans.
For many lexicons,
Senti4SD might not have the accurate weight
and hence inaccurately give a polar rating.
Some common words like "error", "wrong" or "problem"
have been rated as negative by the tool
while the human raters found them neutral.
There are also cases when the tool gave a neutral rating
but humans found some pole of emotion.
For example, the tool might not have an adequate weight
for words like "thanks", "congrats", "LGTM"
and the emoticons which human raters found
as signs of positive sentiment.
The tool could also not catch
subtle criticism or frustration in the comments and
failed on very short texts like
"Yikes" and "Sorry".
Again, for the texts containing mixed sentiments,
the tool could not appropriately find
the eventual intention of the commenter.
Other sentiment analysis tools did not
exactly have the same points of differences
and had varying precision and recall over all the classes.
%The rationale behind this discrepancy is that different tools had different training data and therefore had different weights to different judging criteria. This also shows us the need of having a standard annotation scheme for software engineering domain, and train a tool based on that.

For the politeness tool,
there are a lot of samples
which went marginally wrong
judging by the score that the tool provided for them.
Similar to Senti4SD, the tools have problems
on short texts and
with proper lexicon weights.
The failure for impolite comments is because
impoliteness is much more subtle within the context.
Also, even for some explicit syntactic features,
like direct orders (e.g. "Do not add", "see above", "replace")
came out as neutral by the tool.
However, having distinct cases of failures for these tools
suggests improvement is possible.

\section{Threats to Validity}
One major threat of our study is
that we only had two human coders
per each comment.
Given by the subjectivity of such analysis,
more coders could ensure more reliable human evaluation.
Also, we had one coder rating all the comments
and two other rating half of the data set each.
This creates a possibility of human bias
in our manual rating.
However, we also present the results over the comments
for which both the coders agree on the label
without having a prior discussion.
Besides this, Danescu-Niculescu-Mizil and his colleagues
for their politeness tool used Brown and Levinson's theory
to extract features of politeness out of the data
that were manually rated.
Our coding scheme is also based on the same politeness theory.
This biases our annotation of the comments
towards the tool's internal workings.
%If we had a different or no coding scheme during the rating, the performance of the politeness with respect to the human rating could have come out worse.
Also,
we removed the URLs
before feeding the comments to the tool
but
not from the comments provided to the coders
which creates a difference between the inputs.
%However,
%the human coders did not take the URLs into account
%as they are not commentary texts.
Finally, while we randomly picked 589 comments, they may not be representative of the whole GitHub community.

\section{Conclusion}
Studying the emotions expressed by the developers
is a fertile ground for research.
However, we find that
not only the popular existing tools are unreliable,
even humans too are inconsistent
in identifying sentiment and politeness
in developer discussions.
This demonstrates the need for
standardized coding schemes
for the human coders
in order to build an oracle
and then perform customized training
on the tools
to perform reliable affect analysis
in the software engineering domain.

%Through these conclusions, we demonstrate the need for a standard and more concrete definition of affects and their coding schemes to develop a reliable ground truth for affect analysis and train customized tools for the software engineering domain.

\section*{Acknowledgements}

Thanks to the Developer Liberation Front and anonymous
reviewers for their helpful suggestions.
This material is based in part upon work supported by the
National Science Foundation under grant number 1252995.


\begin{comment}
\subsection{Citations}
Citations to articles~\cite{bowman:reasoning,
clark:pct, braams:babel, herlihy:methodology},
conference proceedings~\cite{clark:pct} or maybe
books \cite{Lamport:LaTeX, salas:calculus} listed
in the Bibliography section of your
article will occur throughout the text of your article.
You should use BibTeX to automatically produce this bibliography;
you simply need to insert one of several citation commands with
a key of the item cited in the proper location in
the \texttt{.tex} file~\cite{Lamport:LaTeX}.
The key is a short reference you invent to uniquely
identify each work; in this sample document, the key is
the first author's surname and a
word from the title.  This identifying key is included
with each item in the \texttt{.bib} file for your article.