sentiment knowledge discovery in twitter streaming data

29
Sentiment Knowledge Discovery in Twitter Streaming Data Albert Bifet and Eibe Frank University of Waikato Hamilton, New Zealand Canberra, 7 October 2010 Discovery Science 2010

Upload: albert-bifet

Post on 08-May-2015

1.225 views

Category:

Technology


2 download

DESCRIPTION

This talk shows how to use the new Twitter Streaming API to obtain new knowledge using data mining methods for evolving data streams.

TRANSCRIPT

Page 1: Sentiment Knowledge Discovery in Twitter Streaming Data

Sentiment Knowledge Discovery in Twitter Streaming Data

Albert Bifet and Eibe Frank

University of WaikatoHamilton, New Zealand

Canberra, 7 October 2010Discovery Science 2010

Page 2: Sentiment Knowledge Discovery in Twitter Streaming Data

Twitter: A Massive Data Stream

Web 2.0

Micro-blogging serviceBuilt to discover what is happening at any moment in time,anywhere in the world.106 million registered users600 million search queries per day3 billion requests a day via its API.

2 / 26

Page 3: Sentiment Knowledge Discovery in Twitter Streaming Data

Outline

1 Twitter Streaming Data

2 Twitter Sentiment Classification: Metrics and Methods

3 Empirical results

3 / 26

Page 4: Sentiment Knowledge Discovery in Twitter Streaming Data

Outline

1 Twitter Streaming Data

2 Twitter Sentiment Classification: Metrics and Methods

3 Empirical results

4 / 26

Page 5: Sentiment Knowledge Discovery in Twitter Streaming Data

Data stream classification cycle

1 Process an example at a time,and inspect it only once (atmost)

2 Use a limited amount ofmemory

3 Work in a limited amount oftime

4 Be ready to predict at anypoint

5 / 26

Page 6: Sentiment Knowledge Discovery in Twitter Streaming Data

Data stream classification cycle

Evaluation procedures for DataStreams

HoldoutInterleaved Test-Then-Train("Prequential" Evaluation)

5 / 26

Page 7: Sentiment Knowledge Discovery in Twitter Streaming Data

Twitter Streaming API

Twitter APIsStreaming APITwo discrete REST APIs

Real-time access to Tweetssampled formfiltered form

HTTP basedGETPOSTDELETE

6 / 26

Page 8: Sentiment Knowledge Discovery in Twitter Streaming Data

Sentiment Analysis on TwitterSentiment analysisClassifying messages into two categories depending onwhether they convey positive or negative feelings

Emoticons are visual cues associated with emotional states,which can be used to define class labels for sentimentclassification

Positive Emoticons Negative Emoticons:) :(:-) :-(: ) : (:D=)

Table: List of positive and negative emoticons.

7 / 26

Page 9: Sentiment Knowledge Discovery in Twitter Streaming Data

Outline

1 Twitter Streaming Data

2 Twitter Sentiment Classification: Metrics and Methods

3 Empirical results

8 / 26

Page 10: Sentiment Knowledge Discovery in Twitter Streaming Data

Streaming Data Evaluation with Unbalanced ClassesPredicted Predicted

Class+ Class- TotalCorrect Class+ 75 8 83Correct Class- 7 10 17Total 82 18 100

Table: Simple confusion matrix example

Predicted PredictedClass+ Class- Total

Correct Class+ 68.06 14.94 83Correct Class- 13.94 3.06 17Total 82 18 100

Table: Confusion matrix for chance predictor

9 / 26

Page 11: Sentiment Knowledge Discovery in Twitter Streaming Data

Streaming Data Evaluation with Unbalanced Classes

Kappa Statisticp0: classifier’s prequential accuracypc : probability that a chance classifier makes a correctprediction.κ statistic

κ =p0−pc

1−pc

κ = 1 if the classifier is always correctκ = 0 if the predictions coincide with the correct ones asoften as those of the chance classifier

Forgetting mechanism for estimating prequential kappaSliding window of size w with the most recent observations

10 / 26

Page 12: Sentiment Knowledge Discovery in Twitter Streaming Data

Data Stream Mining Methods

Multinomial Naïve BayesConsiders a document as a bag-of-words.Estimates the probability of observing word w and the priorprobability P(c)Probability of class c given a test document:

P(c|d) = P(c)∏w∈d P(w |c)nwd

P(d)

11 / 26

Page 13: Sentiment Knowledge Discovery in Twitter Streaming Data

Data Stream Mining Methods

Stochastic Gradient DescentVanilla stochastic gradient descent with a fixed learningrateOptimizing the hinge loss with an L2 penalty commonlyapplied to SVMLoss function to optimize:

λ

2||w||2 +∑ [1− (yxw+b)]+

12 / 26

Page 14: Sentiment Knowledge Discovery in Twitter Streaming Data

Data Stream Mining Methods

Hoeffding TreeIncremental decision tree for data streams.Strategy based on the Hoeffding bound

ε =

√R2 ln(1/δ )

2n

A node is expanded by splitting as soon as there issufficient statistical evidence

13 / 26

Page 15: Sentiment Knowledge Discovery in Twitter Streaming Data

Outline

1 Twitter Streaming Data

2 Twitter Sentiment Classification: Metrics and Methods

3 Empirical results

14 / 26

Page 16: Sentiment Knowledge Discovery in Twitter Streaming Data

What is MOA?

{M}assive {O}nline {A}nalysis is a framework for mining datastreams.

Based on experience with Weka and VFMLFocussed on classification trees, but lots of activedevelopment: clustering, item set and sequence mining,regressionEasy to extendEasy to design and run experiments

15 / 26

Page 17: Sentiment Knowledge Discovery in Twitter Streaming Data

MOA: the bird

The Moa (another native NZ bird) is not only flightless, like theWeka, but also extinct.

16 / 26

Page 18: Sentiment Knowledge Discovery in Twitter Streaming Data

Twitter Sentiment Corpora

Twitter Sentiment Corpustwittersentiment.appspot.com

Alec Go, Richa Bhayani, Karthik Raghunathan, and LeiHuangWebsite to research the sentiment for a brand, product, ortopic.Training dataset with messages between April 2009 andJune 25, 2009

800,000 tweets with positive emoticons800,000 tweets with negative emoticons

Test dataset manually annotated177 negative tweets182 positive ones

17 / 26

Page 19: Sentiment Knowledge Discovery in Twitter Streaming Data

Twitter Sentiment Corpora

Edinburgh Corpushttp://demeter.inf.ed.ac.uk

Sasa Petrovic, Miles Osborne, and Victor Lavrenko97 million tweets (14 GB)Each tweet contains

timestamp of the tweet,anonymized user namethe tweet’s textthe posting method that was used

Collected between November 11th 2009 and February 1st2010, using Twitter’s streaming API.

18 / 26

Page 20: Sentiment Knowledge Discovery in Twitter Streaming Data

Twitter Empirical Evaluation

Sliding Window Prequential Accuracy

30

40

50

60

70

80

90

100

0,01

0,08

0,15

0,22

0,29

0,36

0,43 0,

50,

570,

640,

710,

780,

850,

920,

991,

061,

13 1,2

1,27

1,34

1,41

1,48

1,55

Millions of Instances

Ac

cu

rac

y %

NB Multinomial SGD Hoeffding Tree Class Distribution

Figure: Accuracy and Kappa Statistic on twittersentimentcorpus

19 / 26

Page 21: Sentiment Knowledge Discovery in Twitter Streaming Data

Twitter Empirical Evaluation

Sliding Window Kappa Statistic

0

10

20

30

40

50

60

70

80

0,01

0,08

0,15

0,22

0,29

0,36

0,43

0,50

0,57

0,64

0,71

0,78

0,85

0,92

0,99

1,06

1,13

1,20

1,27

1,34

1,41

1,48

1,55

Millions of Instances

Ka

pp

a S

tati

sti

c

NB Multinomial SGD Hoeffding Tree Class Distribution

Figure: Accuracy and Kappa Statistic on twittersentimentcorpus

19 / 26

Page 22: Sentiment Knowledge Discovery in Twitter Streaming Data

Twitter Empirical Evaluation

Sliding Window Prequential Accuracy

75

77

79

81

83

85

87

89

91

93

95

0,01 0,

10,

190,

280,

370,

460,

550,

640,

730,

820,

91 11,

091,

181,

271,

361,

451,

541,

631,

721,

81 1,9

1,99

2,08

Millions of Instances

Ac

cu

rac

y %

NB Multinomial SGD Hoeffding Tree Class Distribution

Figure: Accuracy and Kappa Statistic on Edinburgh corpus

20 / 26

Page 23: Sentiment Knowledge Discovery in Twitter Streaming Data

Twitter Empirical Evaluation

Sliding Window Kappa Statistic

0

10

20

30

40

50

60

70

80

90

100

0,0

1

0,1

0,1

9

0,2

8

0,3

7

0,4

6

0,5

5

0,6

4

0,7

3

0,8

2

0,9

1 1

1,0

9

1,1

8

1,2

7

1,3

6

1,4

5

1,5

4

1,6

3

1,7

2

1,8

1

1,9

1,9

9

2,0

8

Millions of Instances

Ka

pp

a S

tati

sti

c

NB Multinomial SGD Hoeffding Tree Class Distribution

Figure: Accuracy and Kappa Statistic on Edinburgh corpus

20 / 26

Page 24: Sentiment Knowledge Discovery in Twitter Streaming Data

twittersentiment Corpus

Prequential Accuracy and Kappa

Accuracy Kappa TimeMultinomial Naïve Bayes 75.05% 50.10% 116.62 sec.SGD 82.80% 62.60% 219.54 sec.Hoeffding Tree 73.11% 46.23% 5525.51 sec.

Total prequential accuracy and Kappa measured on thetwittersentiment data stream

21 / 26

Page 25: Sentiment Knowledge Discovery in Twitter Streaming Data

Edinburgh Corpus

Prequential Accuracy and Kappa

Accuracy Kappa TimeMultinomial Naïve Bayes 86.11% 36.15% 173.28, sec.SGD 86.26% 31.88% 293.98 sec.Hoeffding Tree 84.76% 20.40% 6151.51 sec.

Total prequential accuracy and Kappa obtained on theEdinburgh corpus data stream.

22 / 26

Page 26: Sentiment Knowledge Discovery in Twitter Streaming Data

SGD coefficient variations on the Edinburgh corpus

Middle of Stream End of StreamTags Coefficient Coefficient Variationapple 0.3 0.7 0.4microsoft -0.4 -0.1 0.3facebook -0.3 0.4 0.7mcdonalds 0.5 0.1 -0.4google 0.3 0.6 0.3disney 0.0 0.0 0.0bmw 0.0 -0.2 -0.2pepsi 0.1 -0.6 -0.7dell 0.2 0.0 -0.2gucci -0.4 0.6 1.0amazon -0.1 -0.4 -0.3

23 / 26

Page 27: Sentiment Knowledge Discovery in Twitter Streaming Data

Summary

Twitter is a new “what’s-happening-right-now” tool

Twitter as a stream mining dataset for real-time predictionsSliding window Kappa statisticRecommend SGD-based model

24 / 26

Page 28: Sentiment Knowledge Discovery in Twitter Streaming Data

twittersentiment Corpus

Hold-out Accuracy and KappaAccuracy Kappa

Multinomial Naïve Bayes 82.45% 64.89%SGD 78.55% 57.23%Hoeffding Tree 69.36% 38.73%

Accuracy and Kappa for the test dataset obtained fromtwittersentiment

25 / 26

Page 29: Sentiment Knowledge Discovery in Twitter Streaming Data

Edinburgh Corpus

Hold-out Accuracy and KappaAccuracy Kappa

Multinomial Naïve Bayes 73.81% 47.28%SGD 67.41% 34.23%Hoeffding Tree 60.72% 20.59%

Accuracy and Kappa for the test dataset obtained fromtwittersentiment using the Edinburgh corpus as trainingdata stream.

26 / 26