sentiment analysis tools for software engineering research cannot be used out of the box
TRANSCRIPT
On sentiment analysis tools for software engineering research
Robbert Jongeling Subhajit Datta Alexander SerebrenikEindhoven U of Technology (NL)
Singapore U of Technology and Design (SG)
Eindhoven U of Technology (NL)
@jongeling_r @datta_subhajit @aserebrenik
E. Guzman, D. Azócar, and Y. Li, “Sentiment analysis of commit
comments in GitHub: An empirical study,” MSR 2014
A.-I. Rousinopoulos, G. Robles, and J. M. González-Barahona, “Sentiment
analysis of Free/Open Source developers: preliminary findings from a case study,” Revista Eletrônica de
Sistemas de Informação, 2014
E. Guzman and B. Bruegge, “Towards emotional awareness in software
development teams,” in Joint Meeting on Foundations of Software Engineering, 2013
D. Pletea, B. Vasilescu, and A. Serebrenik, “Security and emotion: Sentiment analysis of security discussions on GitHub”, MSR
2014
M. Ortu, B. Adams, G. Destefanis, P. Tourani, M. Marchesi, and R. Tonelli, “Are bullies
more productive? empirical study of affectiveness vs. issue fixing time,” in MSR
2015
D. Garcia, M. S. Zanetti, and F. Schweitzer, “The role of emotions in contributors activity: A case study on the Gentoo
community,” in International Conference on Cloud and Green Computing, 2013
E. Guzman, D. Azócar, and Y. Li, “Sentiment analysis of commit
comments in GitHub: An empirical study,” MSR 2014
A.-I. Rousinopoulos, G. Robles, and J. M. González-Barahona, “Sentiment
analysis of Free/Open Source developers: preliminary findings from a case study,” Revista Eletrônica de
Sistemas de Informação, 2014
E. Guzman and B. Bruegge, “Towards emotional awareness in software
development teams,” in Joint Meeting on Foundations of Software Engineering, 2013
D. Pletea, B. Vasilescu, and A. Serebrenik, “Security and emotion: Sentiment analysis of security discussions on GitHub”, MSR
2014
M. Ortu, B. Adams, G. Destefanis, P. Tourani, M. Marchesi, and R. Tonelli, “Are bullies
more productive? empirical study of affectiveness vs. issue fixing time,” in MSR
2015
D. Garcia, M. S. Zanetti, and F. Schweitzer, “The role of emotions in contributors activity: A case study on the Gentoo
community,” in International Conference on Cloud and Green Computing, 2013
NLTK SentiStrength
E. Guzman, D. Azócar, and Y. Li, “Sentiment analysis of commit
comments in GitHub: An empirical study,” MSR 2014
A.-I. Rousinopoulos, G. Robles, and J. M. González-Barahona, “Sentiment
analysis of Free/Open Source developers: preliminary findings from a case study,” Revista Eletrônica de
Sistemas de Informação, 2014
E. Guzman and B. Bruegge, “Towards emotional awareness in software
development teams,” in Joint Meeting on Foundations of Software Engineering, 2013
D. Pletea, B. Vasilescu, and A. Serebrenik, “Security and emotion: Sentiment analysis of security discussions on GitHub”, MSR
2014
M. Ortu, B. Adams, G. Destefanis, P. Tourani, M. Marchesi, and R. Tonelli, “Are bullies
more productive? empirical study of affectiveness vs. issue fixing time,” in MSR
2015
D. Garcia, M. S. Zanetti, and F. Schweitzer, “The role of emotions in contributors activity: A case study on the Gentoo
community,” in International Conference on Cloud and Green Computing, 2013
NLTK SentiStrength
Trained on movie/product reviews. Threat: might misidentify (or fail to identify) a sentiment in a software engineering artefact
• RQ1: To what extent do different sentiment analysis tools agree with emotions of software developers?
• RQ2: To what extent do different sentiment analysis tools agree with each other?
• RQ3: Do different sentiment analysis tools lead to contradictory results in a software engineering study?
Murgia et al. MSR 2014
392 comments x 4 evaluators
joy love surprise anger fearsadness
positive negative{ {RQ1 RQ2
Murgia et al. MSR 2014
392 comments x 4 evaluators
joy love surprise anger fearsadness
positive negative{ {
Consistent: positive: 3 positive, none negative negative: 3 negative, none positive neutral: ≥3 without emotion indication
AlchemyStanford NLPNLTKSentiStrength
RQ1Manual
neg neu pos
Tool
neg
neu
pos
RQ2Tool A
neg neu pos
Tool B
neg
neu
pos
RQ1 RQ2
Murgia et al. MSR 2014
392 comments x 4 evaluators
joy love surprise anger fearsadness
positive negative{ {
Consistent: positive: 3 positive, none negative negative: 3 negative, none positive neutral: ≥3 without emotion indication
AlchemyStanford NLPNLTKSentiStrength
RQ1Manual
neg neu pos
Tool
neg
neu
pos
5424
217
0 ≤ Adjusted Rand Index ≤ 1[Santos, Embrechts, ICANN 2009]
RQ2Tool A
neg neu pos
Tool B
neg
neu
pos
RQ1 RQ2
Murgia et al. MSR 2014
392 comments x 4 evaluators
joy love surprise anger fearsadness
positive negative{ {
Consistent: positive: 3 positive, none negative negative: 3 negative, none positive neutral: ≥3 without emotion indication
AlchemyStanford NLPNLTKSentiStrength
RQ1Manual
neg neu pos
Tool
neg
neu
pos
5424
217
0 ≤ Adjusted Rand Index ≤ 1[Santos, Embrechts, ICANN 2009]
RQ2Tool A
neg neu pos
Tool B
neg
neu
pos
RQ1 RQ2
RQ1: To what extent do different sentiment analysis tools agree with emotions of software developers?
RQ1Manual
neg neu pos
NLTK
neg 19 51 11
neu 0 138 7
pos 5 28 36
Tool ARI
NLTK 0.239
SentiStrength 0.113
Stanford NLP 0.108
Alchemy 0.079
Tools do not agree with manual evaluation
RQ1 RQ2
RQ2: To what extent do different sentiment analysis tools agree with each other?
RQ2SentiStrength
neg neu pos
NLTK
neg 17 39 25
neu 15 96 34
pos 6 20 43
Tool A Tool B ARINLTK Alchemy 0.104NLTK SentiStrength 0.090
Tools do not agree with each other
RQ1 RQ2
RQ3
issue tracker
over
text
response time
Sentiment Analysis Tool
compare times for neg, neu, pos
issues/questionsq & a site
NLTK
issue tracker
over
text
response time
Sentiment Anal. Tool
compare times for neg, neu, pos
issues/questionsq & a site
NLTK ∩ SentiStrength
issue tracker
over
text
response time
Sentiment Anal. Tool
compare times for neg, neu, pos
issues/questionsq & a site
SentiStrength
RQ3
issue tracker
over
text
response time
Sentiment Analysis Tool
compare times for neg, neu, pos
issues/questionsq & a site
NLTK
Are the results the same?
NLTK SentiStrength NLTK ∩ SentiStrength
ASFdescr
neg > neu*** neg > neu***pos > neu*** pos > neu*** pos > neu***
pos > neg*** pos > neg***
ASF titleneg > neu**pos > neu*** pos > neu**
pos > neg* pos > neg**
GNOME descr
neg > neu*** neg > neu*** neg > neu***pos > neu*** pos > neu*** pos > neu***pos > neg***
neg > pos***SO
descr ø neg > pos* ø
RQ3 RQ3: Do different sentiment analysis tools lead to contradictory results in a software engineering study?
Choice of the sentiment analysis tool affects results of the software engineering study
Tools do not agree with manual evaluationTools do not agree with each other
Choice of the sentiment analysis tool affects results of the software engineering study
SummarySentiment analysis tools are trained on movie/
product reviews. Threat: might misidentify (or fail to identify) a sentiment in a software engineering artefact
Next steps?
• Train sentiment analysis tools on software engineering data
• Data of Murgia et al.: first step
• More and better-suited data is needed