mining twitter data with resource constraints - ieee/acm conference on web intelligence 2014

18
Mining Twitter Data with Resource Constraints Geoge Valkanas, Ioannis Katakis , Dimitrios Gunopulos, Anthony Stefanidis August 12, 2015 Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 1 / 18

Upload: ioannis-katakis

Post on 14-Jun-2015

399 views

Category:

Science


2 download

DESCRIPTION

Social media analysis constitutes a scientific field that is rapidly gaining ground due to its numerous research challenges and practical applications, as well as the unprecedented availability of data in real time. Several of these applications have significant social and economical impact, such as journalism, crisis management, advertising, etc. However, two issues regarding these applications have to be confronted. The first one is the financial cost. Despite the abundance of information, it typically comes at a premium price, and only a fraction is provided free of charge. For example, Twitter, a predominant social media online service, grants researchers and practitioners free access to only a small proportion (1%) of its publicly available stream. The second issue is the computational cost. Even when the full stream is available, off the shelf approaches are unable to operate in such settings due to the real-time computational demands. Consequently, real world applications as well as research efforts that exploit such information are limited to utilizing only a subset of the available data. In this paper, we are interested in evaluating the extent to which analytical processes are affected by the aforementioned limitation. In particular, we apply a plethora of analysis processes on two subsets of Twitter public data, obtained through the service’s sampling API’s. The first one is the default 1% sample, whereas the second is the Gardenhose sample that our research group has access to, returning 10% of all public data. We extensively evaluate their relative performance in numerous scenarios.

TRANSCRIPT

Page 1: Mining Twitter Data with Resource Constraints - IEEE/ACM Conference on Web Intelligence 2014

Mining Twitter Data with Resource Constraints

Geoge Valkanas, Ioannis Katakis,Dimitrios Gunopulos, Anthony Stefanidis

August 12, 2015

Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 1 / 18

Page 2: Mining Twitter Data with Resource Constraints - IEEE/ACM Conference on Web Intelligence 2014

Research Question

Is the 1% sample provided by the Twitter API sufficient forspatio-temporal analysis tasks? ... which tasks?→ We compare with the 10% sample (Garden Hose)

Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 2 / 18

Page 3: Mining Twitter Data with Resource Constraints - IEEE/ACM Conference on Web Intelligence 2014

Outline

1 Problem and Motivation

2 Data Collection3 Experiments in Various Tasks

Geo-location CoverageSentiment AnalysisPopular Topic DetectionGraph Evolution

4 Conclusions

Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 3 / 18

Page 4: Mining Twitter Data with Resource Constraints - IEEE/ACM Conference on Web Intelligence 2014

Introduction

Twitter Samples

Two ways to access the stream

Public Stream: 1% Sample

Garden Hose: 10% Sample

... in both cases, we don’t know details about the sampling method.

Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 4 / 18

Page 5: Mining Twitter Data with Resource Constraints - IEEE/ACM Conference on Web Intelligence 2014

Introduction

Constraints

Financial cost

Licences of larger samples, are costly and difficult to obtain.

Computational cost

7 Giga Bytes per minute

Off the shelf approaches are unable to operate in such settings

In practice: those who engage in social media analytical tasks havepractically no choice but to resort to the downsized information. However,being only a small fraction of the entire stream, it is unclear how reliablethis information is for each type of application.

Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 5 / 18

Page 6: Mining Twitter Data with Resource Constraints - IEEE/ACM Conference on Web Intelligence 2014

Introduction

A more concrete example

The INSIGHT Project: Improve understanding, prediction and warning ofemergencies through real-time processing of data streams including socialdata.

(a) Floods in Germany (2013) (b) Control Center in Dublin CC

How much data are efficient for our task?

Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 6 / 18

Page 7: Mining Twitter Data with Resource Constraints - IEEE/ACM Conference on Web Intelligence 2014

Introduction

Tasks we look into...

Sentiment Analysis

Geo-located information

Popular tweets

Social Graph Evolution

Linguistic Analysis

Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 7 / 18

Page 8: Mining Twitter Data with Resource Constraints - IEEE/ACM Conference on Web Intelligence 2014

Data

The data

100K

1M

10M

0 20 40 60 80 100

Tw

eet C

ount

Hours

Default Gardenhose

(c) All tweets

1K

10K

100K

0 20 40 60 80 100

GP

S T

weet C

ount

Hours

Default Gardenhose

(d) GPS-tagged tweets

Figure : Comparing default and gardenhose samples for volume over time

4 day period - November 2013

The two samples differ by an order of magnitude

Exhibit the same temporal pattern

Geotagged tweets are between 1-2% of their respective sampled data

Geotagged are more flattened out

Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 8 / 18

Page 9: Mining Twitter Data with Resource Constraints - IEEE/ACM Conference on Web Intelligence 2014

Experiments

Geo-location coverage - Experiment 1

Bounding Box

Twitter also allows its users to ask for geotagged information.

The user provides a bounding box, by specifying 4 coordinates in theform [(latmin, lonmin)(latmax , lonmax)], and Twitter returns tweets thatfall within this region.

−50

−25

0

25

60 90 120 150lon

lat

. In this particular case, where geotagged tweets are asked for instead of ageneral sample, the volume of the returned results is the same for the two

samples!.

Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 9 / 18

Page 10: Mining Twitter Data with Resource Constraints - IEEE/ACM Conference on Web Intelligence 2014

Experiments

Geo-location coverage - Experiment 2

4 different crawls in London area

0

200

400

600

800

1000

1200

1400

0 5 10 15 20 25 30 35 40 45

Co

un

t

Half-Hour Interval

Loc1 Loc2 Loc3 Loc4

. As the overlap increases between the bounding boxes, so does thesimilarity between two different crawls.

Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 10 / 18

Page 11: Mining Twitter Data with Resource Constraints - IEEE/ACM Conference on Web Intelligence 2014

Experiments

Sentiment Analysis

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 20 40 60 80 100

Ratio

Hour

Sample 1% Sample10%

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0 20 40 60 80 100

Ratio

Hour

Sample 1% Sample10%

Positive and Negative Sentiment Ratio

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0 20 40 60 80 100

Ratio

Hours

Pos 1%Neg 1%

Pos 10%Neg 10%

- Dictionary basedsentiment analysis- Ratio of tweets isthe same in bothsamples- Ratios in geo-taggedtweets are lower,meaning thatgeottagged tweetsoffer lesssentiment-orientedinformation

Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 11 / 18

Page 12: Mining Twitter Data with Resource Constraints - IEEE/ACM Conference on Web Intelligence 2014

Experiments

Popular Topic Detection - Experiment

1 Extract the top-k most retweeted posts, that appear in our data(both samples).

2 Compare the two lists (Kendall Correlation)

3 Compare the two lists with the ground truth (= actual retweet countinformation included in the tweet)

Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 12 / 18

Page 13: Mining Twitter Data with Resource Constraints - IEEE/ACM Conference on Web Intelligence 2014

Experiments

Popular Topic Detection - Results

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

10 100 1000 10000

Kendall C

orr

el.

List Items

S1-S10

S1-S10P1

S1-S10P2

S10P1-S10P2

S1-S1P1

(a) Kendall

0.94

0.95

0.96

0.97

0.98

0.99

1

10 100 1000 10000C

om

mon Ite

ms (

%)

List Items

S1-S10S1-S10P1S1-S10P2

S10P1-S10P2S1-S1P1

(b) Common Items

0

0.1 0.2

0.3 0.4

0.5

0.6 0.7

0.8 0.9

1

1 5 10

100

500

1000

2500

5000

7500

10000

Ke

nd

all

Co

rre

l.

Iteration

Sample 1% Sample 10%

(c) Vs the ground truth

Figure : Comparing the top-N most retweeted items

Conclusions

For up to 10 items, 1% is adequate. That is not however the case forlist with more than 1000 items.

Comparison with Ground Truth: 10% has higher correlation.Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 13 / 18

Page 14: Mining Twitter Data with Resource Constraints - IEEE/ACM Conference on Web Intelligence 2014

Experiments

Graph Evolution Study - Experiment

Study the re-tweet graph (directed)

Edges are weighted (more re-tweets → larger weight) and decay overtime

Edges are removed when their weight drops below a certain threshold

Method 1: Iter At each time interval extract a new graph

Method 2: Glb At each time interval aggregate the new nodes to thecurrent graph

Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 14 / 18

Page 15: Mining Twitter Data with Resource Constraints - IEEE/ACM Conference on Web Intelligence 2014

Experiments

Results

0

50000

100000

150000

200000

250000

300000

0 200 400 600 800 1000 1200

Va

lue

Iteration

Iter 1%

Glb 1%

Iter 10%

Glb 10%

(a) Size

0

10

20

30

40

50

60

70

80

90

100

0 200 400 600 800 1000 1200

Va

lue

Iteration

Glb 1% Glb 10%

(b) Lar. Con. Comp. Size

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0 200 400 600 800 1000 1200

Va

lue

Iteration

Iter 1%

Glb 1%

Iter 10%

Glb 10%

(c) Clustering Coefficient

Figure : Statistical properties of the extracted retweet graph, over time

Conclusions

No significant differences between the two samples

LCC does not follow the 24-hour pattern

Clustering coefficient of 10% similar 100%Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 15 / 18

Page 16: Mining Twitter Data with Resource Constraints - IEEE/ACM Conference on Web Intelligence 2014

Experiments

More on the paper...

Retweet Burstiness

The rate at which users retweet information plays an important rolein capturing trending topics

We investigate wether there is a difference between the rates ofreceiving retweets in both samples

Linguistic Analysis

Is there a correlation between the spoken languages in Twitter, andthe ground truth obtained from studies in the physical world?

What are the differences between the two samples in this context?

We use language detection tools and ground truth information fromWikipedia.

Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 16 / 18

Page 17: Mining Twitter Data with Resource Constraints - IEEE/ACM Conference on Web Intelligence 2014

Summary and Conclusions

Conclusions

Research question: Is the default sample sufficient? For which tasks?

Focused on spatio-temporal tasks

We compared 1% with 10% sample

The samples have quite similar properties

However when you get into the details (less popular re-tweets) thebigger sample is better

Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 17 / 18

Page 18: Mining Twitter Data with Resource Constraints - IEEE/ACM Conference on Web Intelligence 2014

Summary and Conclusions

The End...

Thank You!

Contact: @iokat // [email protected] // www.katakis.eu

AcknowledgementThis work has been co-financed by EU and Greek National funds through the Operational Program “Education and LifelongLearning” of the National Strategic Reference Framework (NSRF) - Research Funding Programs: Heraclitus II fellowship,THALIS - GeomComp, THALIS - DISFER, ARISTEIA - MMD and the EU funded project INSIGHT.

Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 18 / 18