alleviating data sparsity for twitter sentiment analysis

32
Alleviating Data Sparsity For Twitter Sentiment Analysis Hassan Saif, Yulan He & Harith Alani Knowledge Media Institute, The Open University, Milton Keynes, United Kingdom Making Sense of Microblogs – WWW2012 Conference Lyon - France

Upload: knowledge-media-institute-the-open-university

Post on 27-Jan-2015

122 views

Category:

Technology


3 download

DESCRIPTION

Twitter has brought much attention recently as a hot research topic in the domain of sentiment analysis. Training sentiment classifiers from tweets data often faces the data sparsity problem partly due to the large variety of short and irregular forms introduced to tweets because of the 140-character limit. In this work we propose using two different sets of features to alleviate the data sparseness problem. One is the semantic feature set where we extract semantically hidden concepts from tweets and then incorporate them into classifier training through interpolation. Another is the sentiment-topic feature set where we extract latent topics and the associated topic sentiment from tweets, then augment the original feature space with these sentiment-topics. Experimental results on the Stanford Twitter Sentiment Dataset show that both feature sets outperform the baseline model using unigrams only. Moreover, using semantic features rivals the previously reported best result. Using sentiment-topic features achieves 86.3% sentiment classification accuracy, which outperforms existing approaches.

TRANSCRIPT

Page 1: Alleviating Data Sparsity for Twitter Sentiment Analysis

Alleviating Data Sparsity For Twitter Sentiment Analysis

Hassan Saif, Yulan He & Harith AlaniKnowledge Media Institute, The Open University, Milton

Keynes, United Kingdom

Making Sense of Microblogs – WWW2012 ConferenceLyon - France

Page 2: Alleviating Data Sparsity for Twitter Sentiment Analysis

• Hello World• Motivation• Related Work• Semantic Features• Topic-Sentiment Features• Evaluation• Demos• The Future

Outline

Page 3: Alleviating Data Sparsity for Twitter Sentiment Analysis

Sentiment Analysis

“Sentiment analysis is the task of identifying positive and negative opinions, emotions and evaluations in text”

3

The main dish was delicious It is a Syrian dish The main dish was

salty and horrible

Opinion OpinionFact

Page 4: Alleviating Data Sparsity for Twitter Sentiment Analysis

Microblogging

4

• Service which allows subscribers to post short updates online and broadcast them

• Answers the question: What are you doing now?

• Twitter, Plurk, sfeed, Yammer, BlueTwi, etc.

Page 5: Alleviating Data Sparsity for Twitter Sentiment Analysis

Together!

Page 6: Alleviating Data Sparsity for Twitter Sentiment Analysis

Sense?

Page 7: Alleviating Data Sparsity for Twitter Sentiment Analysis

Sense?

10

200000

400000

600000

800000

1000000

1200000

1400000

1143562

713178

Objective Tweets Subjective Tweets

UK General Elections Corpus

AGARWAL et al.

BARBOSA, L., AND FENG, J.

BIFET, A., AND FRANK, E.

DIAKOPOULOS, N., AND SHAMMA, D.

GO et al.

He & Saif

PAK & PAROUBEK

And ManyOthers

Page 8: Alleviating Data Sparsity for Twitter Sentiment Analysis

df

Why

Because It is Critical

Private Sectors

Public Sectors

Keep In Touch

Page 9: Alleviating Data Sparsity for Twitter Sentiment Analysis

Related Work

Page 10: Alleviating Data Sparsity for Twitter Sentiment Analysis

Sentiment Analysis

Machine Learning Approach

Right Features

Text Classification Problem Lexical Based Approach

Building Better Dictionary

Word Polarity

Page 11: Alleviating Data Sparsity for Twitter Sentiment Analysis

Twitter Sentiment Analysis

– The short length of status update

– Language Variations

– Open Social Environment

Challenges

Page 12: Alleviating Data Sparsity for Twitter Sentiment Analysis

Twitter Sentiment Analysis

• Distant Supervision– Supervised classifiers trained from noisy labels

– Tweets messages are labeled using emoticons

– Data filtering process

Go et al., (2009) - Barbosa and Fengl. (2010) – Pak and Paroubek (2010)

Related Work

Page 13: Alleviating Data Sparsity for Twitter Sentiment Analysis

Twitter Sentiment Analysis

• Followers Graph & Label Propagation – Twitter follower graph (users, tweets, unigrams

and hashtags)– Start with small number of labeled tweets– Applied label propagation method throughout the

graph.

Speriosu et al., (2009)

Related Work

Page 14: Alleviating Data Sparsity for Twitter Sentiment Analysis

Twitter Sentiment Analysis

• Feature Engineering– Unigrams, bigrams, POS– Microblogging features • Hashtags• Emoticons• Abbreviations & Intensifiers

Agsrwal et al., (2011) – Kouloumpis et al (2011)

Related Work

Page 15: Alleviating Data Sparsity for Twitter Sentiment Analysis

So?

Page 16: Alleviating Data Sparsity for Twitter Sentiment Analysis

What Does sparsity mean?

Training data contains many infrequent terms

Page 17: Alleviating Data Sparsity for Twitter Sentiment Analysis

What Does sparsity mean?

Word frequency statistics

Page 18: Alleviating Data Sparsity for Twitter Sentiment Analysis

How!

Semantic Features

Sentiment Topic FeaturesExtracts semantically hidden concepts from tweets data and then incorporates them into supervised classifier training by interpolation Extract latent topics and

the associated topic sentiment from the tweets data which are subsequently added into the original feature space for supervised classifier training

Page 19: Alleviating Data Sparsity for Twitter Sentiment Analysis

Semantic Features

Shallow Semantic Method

Sushi time for fabulous Jesse's last day on dragons den

@Stace_meister Ya, I have Rugby in an hour

Dear eBay, if I win I owe you a total 580.63 bye paycheckCompany

Person

Sport

Page 20: Alleviating Data Sparsity for Twitter Sentiment Analysis

Sematic Features

Interpolation Method

Page 21: Alleviating Data Sparsity for Twitter Sentiment Analysis

Topic-Sentiment Features

Joint Sentiment Topic ModelJST1 is a four-layer generative model which allows the detection of both sentiment and topic simultaneously from text.

The only supervision is word prior polarity information which can be obtained from MPQA subjectivity lexicon.

Lin & He. 2009

Page 22: Alleviating Data Sparsity for Twitter Sentiment Analysis

Twitter Sentiment Corpus

Collected: the 6th of April & the 25th of June 2009

Training Set: 1.6 million tweets (Balanced)

Testing Set: 177 negative tweets & 182 positive tweets

Stanford University

http://twittersentiment.appspot.com/

Page 23: Alleviating Data Sparsity for Twitter Sentiment Analysis

Our Sentiment Corpus

• Training Set: 60K tweets• Testing Set: 1000 tweets

• Annotated additional 640 tweets using Tweenator

Page 24: Alleviating Data Sparsity for Twitter Sentiment Analysis

Evaluation

Method Accuracy

Unigrams 80.7%

Semantic replacement 76.3%

Semantic interpolation 84.0%

Sentiment-topic features 82.3%

Sentiment classification results on the 1000-tweets test set

Extended Test Set

Page 25: Alleviating Data Sparsity for Twitter Sentiment Analysis

Evaluation

Method Accuracy

Unigrams 81.0%

Semantic replacement 77.3%

Sematic augmentation 80.45%

Semantic interpolation 84.1%

Sentiment-topic features 86.3%

(Go et al., 2009) 83%

(Speriosu et al., 2011) 84.7%

Sentiment classification results on the original Stanford Twitter Sentiment test set

Stanford Dataset

Page 26: Alleviating Data Sparsity for Twitter Sentiment Analysis

Evaluation

Semantic Features V.s Sentiment-Topic Features

Page 27: Alleviating Data Sparsity for Twitter Sentiment Analysis

Tweenator

http://tweenator.com

Page 28: Alleviating Data Sparsity for Twitter Sentiment Analysis

Conclusion

30

• Twitter sentiment analysis faces data sparsity problem due to some special characteristics of Twitter

• Semantic & topic-sentiment features reduce the sparsity problem and increase the performance significantly

• Sentiment-topic features should be preferred over semantic features for the sentiment classification task since it gives much better results with far less features.

Page 29: Alleviating Data Sparsity for Twitter Sentiment Analysis

The Future

Page 30: Alleviating Data Sparsity for Twitter Sentiment Analysis

Future Work

Semantic Smoothing Model

Statistical replacement

Enhance Entity Extraction Methods

Attaching weight to extracted features

Sentiment-Topic Model

Page 31: Alleviating Data Sparsity for Twitter Sentiment Analysis

References[1] AGARWAL, A., XIE, B., VOVSHA, I., RAMBOW, O., AND PASSONNEAU, R. Sentiment analysis of twitter data. In Proceedings of the ACL 2011 Workshop on Languages in Social Media (2011), pp. 30–38.

[2] BARBOSA, L., AND FENG, J. Robust sentiment detection on twitter from biased and noisy data. In Proceedings of COLING (2010), pp. 36–44.

[3] BIFET, A., AND FRANK, E. Sentiment knowledge discovery in twitter streaming data. In Discovery Science (2010), Springer, pp. 1–15.

[4] GO, A., BHAYANI, R., AND HUANG, L. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford (2009).

[5] KOULOUMPIS, E., WILSON, T., AND MOORE, J. Twitter sentiment analysis: The good the bad and the omg! In Proceedings of the ICWSM (2011).

[5]LIN, C., AND HE, Y. Joint sentiment/topic model for sentiment analysis. In Proceeding of the 18th ACM conference on Information and knowledge management (2009), ACM, pp. 375–384.[6] PAK, A., AND PAROUBEK, P. Twitter as a corpus for sentiment analysis and opinion mining. Proceedings of LREC 2010 (2010).

[7]SAIF, H., HE, Y., AND ALANI, H. Semantic Smoothing for Twitter Sentiment Analysis. In Proceeding of the 10th International Semantic Web Conference (ISWC) (2011).

Page 32: Alleviating Data Sparsity for Twitter Sentiment Analysis

Thank

You