# rsc: mining and modeling temporal activity in social media

Post on 17-Mar-2018

947 views

Embed Size (px)

TRANSCRIPT

RSC: Mining and Modeling Temporal

Activity in Social Media

Alceu F. Costa* Yuto Yamaguchi Agma J. M. Traina

Caetano Traina Jr. Christos Faloutsos

1

Universidade

de So Paulo

KDD 2015 Sydney, Australia

*alceufc@icmc.usp.br

Introduction

2

Users generate sequences of time-stamps when

they use a social media Web site

What can we learn from time-stamps?

Are there common patterns?

Can we tell if a user is a bot or a human?

Sequence of tweets

time-stamps:

Bars are tweets

time-stamps

Outline

Pattern MiningWhat patterns can we discover from temporal activities of social media users?

Modeling

Bot Detection

Experiments

Conclusion

3

Reddit DatasetTime-stamp from comments

21,198 users

20 Million time-stamps

Twitter DatasetTime-stamp from tweets

6,790 users

16 Million time-stamps

Pattern Mining: Datasets

For each user we have:

Sequence of postings time-stamps: T = (t1, t2, t3, )

Inter-arrival times (IAT) of postings: (1, 2, 3, )

4

t1 t2 t3 t4

1 2 3

time

Pattern Mining

Pattern 1: Distribution of IAT is heavy-tailedUsers can be inactive for long periods of time before making new postings

IAT Complementary Cumulative Distribution Function (CCDF)

(log-log axis)

5Reddit Users Twitter Users

Pattern Mining

Pattern 2: Bimodal IAT distribution

Users have highly active sections and resting periods

Log-binned histogram of postings IAT

6Twitter Users

102

104

106

0

0.005

0.01

0.015

D, IAT (seconds)

PD

F

1st Mode (1min) 2nd Mode (3h)

102

104

106

0

0.005

0.01

D, IAT (seconds)

PD

F

Pattern Mining

Pattern 3: Periodic spikes

in the IAT distribution

Caused by daily sleeping

intervals

7

105

0

0.005

0.01

0.015

D, IAT (seconds)

PD

F

7h 12h 24h 48h 72h

Reddit Users

Pattern Mining

Pattern 4: Consecutive IAT are correlatedLong/short IAT are likely to be followed by long/short IAT

Heat-map: pairs

of consecutive IAT

All Reddit users

8

Concentration of

pairs in the

diagonal: positive

correlation

Outline

Pattern Mining

Modeling

Can we model the patterns?

Bot Detection

Experiments

Conclusion

9

RSC Model

Can we generate synthetic time-stamps that match real data patterns?

10

PatternPoisson

Process

Queue

BasedBarabsi,

2005

CNPPMalmgren

et al.,

2009

SFPVaz de Melo

et al.,

2013

RSCProposed

Model

Heavy

Tails

Bimodal

Distribution

Periodic

Spikes

IAT

Correlation

Proposed Model: Rest-Sleep-and-Comment

RSC Model

Base model: Self-Correlated Process (SCorr)

Definition: A stochastic process is a SCorr process with

base rate and correlation if:

Consecutive IAT are correlated:

The i-th IAT i depends on the previous (i-1)-th IAT i-1 controls correlation strength:

If = 0, SCorr reduces to an exponential distribution11

X ~ Exp(1/)

exponential random

variable with rate i ~ Exp(i-1 + 1/)Details

SCorr Process

RSC Model

12

Correlated IAT

Heavy Tail

Bimodal Distribution

Periodic Spikes

Consecutive IAT DistributionSCorr (synthetic data)

= 20h, = 0.7

RSC Model

13

= 20h, = 0.7

Correlated IAT

Heavy Tail

Bimodal Distribution

Periodic Spikes

IAT CCDF

Reddit Data

SCorr

SCorr Process

RSC Model

14

= 20m, = 1.0

Correlated IAT

Heavy Tail

Bimodal Distribution

Periodic Spikes

IAT Log-binned Histogram

Data

SCorr

SCorr Process

RSC Model

Model StatesActive:

1. Wait ~ SCorr(A, A)

2. Post with probability ppost3. Transition

Rest:

1. Wait ~ SCorr(R, R)

2. Transition

Base rates: A > RAverage wait time for active state issmaller when compared to rest state

State Transitions

15

Active

Rest

1-pR

pR 1-pA pA

Details

RSC Model

16

Heavy Tail

Correlated IAT

Bimodal Distribution

Periodic Spikes

IAT Log-binned Histogram

Data

Synth.

SCorr + Rest and Active States

RSC Model

Keep track of current time:tclock variable, 0:00h < tclock < 23:59h

Update tclock after each wait time

Enter the sleep state if:Current state = rest and

(tclock < twake or tclock > tsleep)

In the sleep state:1. Wait until next wake-up time, twake2. Transition to rest state

17

tsleep

twake

tclock

Sleep

Awake

Modeling periodic spikes: sleep state

Details

RSC Model

18

Heavy Tail

Correlated IAT

Bimodal Distribution

Periodic Spikes

Parameter estimation uses the

Levenberg-Marquardt algorithm

IAT Log-binned Histogram

Complete RSC Model

Outline

Pattern Mining

Modeling

Bot DetectionCan we spot automated behavior based only on time-stamp data?

Experiments

Conclusion

19

Bot Detection

Problem: Given labeled time-stamp data from a set of

users {U1, U2, U3, } decide if a unknown user Ui is a

human or a bot.

Solution: RSC-Spotter

Compare users IAT to synthetic IAT generated by the RSC model

If not similar to RSC, then is the user is likely to be a bot

20

0 10 20 30 40 50 60 70

Time (days)

Sequence of time-stamps

from a single user The user that produced

the time-stamps is a

human or a bot?

RSC-Spotter

Comparing Time-stamps

Estimate the RSC parametersTime-stamps from all users

For each user:

1. Compute the IAT histogram Using log-binned bins

2. Generate synthetic time-stamps using RSC

RSC can generate the same number of time-stamps as the user

3. Compare user and synthetic IAT histogram

Cost sensitive classification is used to decide if a user is a bot given the dissimilarity D 21

, IAT

Bin Counts

(user data)ci

, IAT

Bin Counts

(synthetic) i

D = i |ci i|

(dissimilarity)

Details

Outline

Pattern Mining

Modeling

Bot Detection

ExperimentsCan RSC match real data?

How well can RSC-Spotter detect bots?

Conclusion

22

Reddit Users

Twitter

Users

Experiments: Can RSC Match Real Data?

23

Pattern CNPP SFP RSC

Heavy

Tail

Bimodal

Spikes

IAT

Correlation

RSCProposed model

CNPPMalmgren et al.

SFPVaz de Melo et al

CNPP fails to match

the heavy tail

Experiments: Can RSC Match Real Data?

24

Pattern CNPP SFP RSC

Heavy

Tail

Bimodal

Spikes

IAT

Correlation

Two Modes No Periodic

Spikes

Reddit Users

CNPPMalmgren et al.

Experiments: Can RSC Match Real Data?

25

Pattern CNPP SFP RSC

Heavy

Tail

Bimodal

Spikes

IAT

Correlation

Reddit Users

Single Mode No Periodic

Spikes

SFPVaz de Melo et al

Experiments: Can RSC Match Real Data?

26

Pattern CNPP SFP RSC

Heavy

Tail

Bimodal

Spikes

IAT

Correlation

Reddit Users

Twitter

Users

Two Modes Periodic

Spikes

Reddit Users

RSCProposed model

Experiments: Can RSC Match Real Data?

27

Pattern CNPP SFP RSC

Heavy

Tail

Bimodal

Spikes

IAT

Correlation

Twitter

Data

CNPP

Fit

No IAT

Correlation

CNPPMalmgren et al.

Experiments: Can RSC Match Real Data?

28

Pattern CNPP SFP RSC

Heavy

Tail

Bimodal

Spikes

IAT

Correlation

Twitter

Data

SFP

Fit

IAT Correlation

(but too strong!)

SFPVaz de Melo et al

Experiments: Can RSC Match Real Data?

29

Pattern CNPP SFP RSC

Heavy

Tail

Bimodal

Spikes

IAT

Correlation

Twitter

Data

RSC

Fit

IAT Correlation

RSCProposed model

Outline

Pattern Mining

Modeling

Bot Detection

ExperimentsCan RSC Match Real Data?

How well can RSC-Spotter detect bots?

Conclusion

30

Experiments: Can RSC-Spotter Detect Bots?

Methodology

Datasets

Users were manually labeled as bot or humans

Training

Same size for train and test subsets (preserved class distribution)

Baseline features:

31

1,963 Humans

37 BotsReddit

1353 Humans

64 BotsTwitter

1. IAT Histogram

Log-binned IAT

histogram

2. Entropy

Entropy of the

IAT histogram

3. Week Hist.

# of postings

for day of week

4. All features

Combination of

1, 2 and 3

Experiments: Can RSC-Spotter Detect Bots?

Precision vs. Sensitivity CurvesGood perform