sentiment analysis tools for software engineering research cannot be used out of the box

On sentiment analysis tools for software engineering research

Robbert Jongeling Subhajit Datta Alexander SerebrenikEindhoven U of Technology (NL)

Singapore U of Technology and Design (SG)

Eindhoven U of Technology (NL)

@jongeling_r @datta_subhajit @aserebrenik

E. Guzman, D. Azócar, and Y. Li, “Sentiment analysis of commit

comments in GitHub: An empirical study,” MSR 2014

A.-I. Rousinopoulos, G. Robles, and J. M. González-Barahona, “Sentiment

analysis of Free/Open Source developers: preliminary findings from a case study,” Revista Eletrônica de

Sistemas de Informação, 2014

E. Guzman and B. Bruegge, “Towards emotional awareness in software

development teams,” in Joint Meeting on Foundations of Software Engineering, 2013

D. Pletea, B. Vasilescu, and A. Serebrenik, “Security and emotion: Sentiment analysis of security discussions on GitHub”, MSR

2014

M. Ortu, B. Adams, G. Destefanis, P. Tourani, M. Marchesi, and R. Tonelli, “Are bullies

more productive? empirical study of affectiveness vs. issue fixing time,” in MSR

2015

D. Garcia, M. S. Zanetti, and F. Schweitzer, “The role of emotions in contributors activity: A case study on the Gentoo

community,” in International Conference on Cloud and Green Computing, 2013









2014



2015



NLTK SentiStrength









2014



2015



NLTK SentiStrength

Trained on movie/product reviews. Threat: might misidentify (or fail to identify) a sentiment in a software engineering artefact

• RQ1: To what extent do different sentiment analysis tools agree with emotions of software developers?

• RQ2: To what extent do different sentiment analysis tools agree with each other?

• RQ3: Do different sentiment analysis tools lead to contradictory results in a software engineering study?

Murgia et al. MSR 2014

392 comments x 4 evaluators

joy love surprise anger fearsadness

positive negative{ {RQ1 RQ2




positive negative{ {

Consistent: positive: 3 positive, none negative negative: 3 negative, none positive neutral: ≥3 without emotion indication

AlchemyStanford NLPNLTKSentiStrength

RQ1Manual

neg neu pos

Tool

neg

neu

pos

RQ2Tool A

neg neu pos

Tool B

neg

neu

pos

RQ1 RQ2




positive negative{ {

Consistent: positive: 3 positive, none negative negative: 3 negative, none positive neutral: ≥3 without emotion indication

AlchemyStanford NLPNLTKSentiStrength

RQ1Manual

neg neu pos

Tool

neg

neu

pos

5424

217

0 ≤ Adjusted Rand Index ≤ 1[Santos, Embrechts, ICANN 2009]

RQ2Tool A

neg neu pos

Tool B

neg

neu

pos

RQ1 RQ2

RQ1: To what extent do different sentiment analysis tools agree with emotions of software developers?

RQ1Manual

neg neu pos

NLTK

neg 19 51 11

neu 0 138 7

pos 5 28 36

Tool ARI

NLTK 0.239

SentiStrength 0.113

Stanford NLP 0.108

Alchemy 0.079

Tools do not agree with manual evaluation

RQ1 RQ2

RQ2: To what extent do different sentiment analysis tools agree with each other?

RQ2SentiStrength

neg neu pos

NLTK

neg 17 39 25

neu 15 96 34

pos 6 20 43

Tool A Tool B ARINLTK Alchemy 0.104NLTK SentiStrength 0.090

Tools do not agree with each other

RQ1 RQ2

RQ3

issue tracker

over

text

response time

Sentiment Analysis Tool

compare times for neg, neu, pos

issues/questionsq & a site

NLTK

issue tracker

over

text

response time

Sentiment Anal. Tool



NLTK ∩ SentiStrength

issue tracker

over

text

response time

Sentiment Anal. Tool



SentiStrength

RQ3

issue tracker

over

text

response time

Sentiment Analysis Tool



NLTK

Are the results the same?

NLTK SentiStrength NLTK ∩ SentiStrength

ASFdescr

neg > neu*** neg > neu***pos > neu*** pos > neu*** pos > neu***

pos > neg*** pos > neg***

ASF titleneg > neu**pos > neu*** pos > neu**

pos > neg* pos > neg**

GNOME descr

neg > neu*** neg > neu*** neg > neu***pos > neu*** pos > neu*** pos > neu***pos > neg***

neg > pos***SO

descr ø neg > pos* ø

RQ3 RQ3: Do different sentiment analysis tools lead to contradictory results in a software engineering study?

Choice of the sentiment analysis tool affects results of the software engineering study

Tools do not agree with manual evaluationTools do not agree with each other

Choice of the sentiment analysis tool affects results of the software engineering study

SummarySentiment analysis tools are trained on movie/

product reviews. Threat: might misidentify (or fail to identify) a sentiment in a software engineering artefact

Next steps?

• Train sentiment analysis tools on software engineering data

• Data of Murgia et al.: first step

• More and better-suited data is needed