rsc: mining and modeling temporal activity in social media

Download RSC: Mining and Modeling Temporal Activity in Social Media

Post on 17-Mar-2018

947 views

Category:

Data & Analytics

3 download

Embed Size (px)

TRANSCRIPT

  • RSC: Mining and Modeling Temporal

    Activity in Social Media

    Alceu F. Costa* Yuto Yamaguchi Agma J. M. Traina

    Caetano Traina Jr. Christos Faloutsos

    1

    Universidade

    de So Paulo

    KDD 2015 Sydney, Australia

    *alceufc@icmc.usp.br

  • Introduction

    2

    Users generate sequences of time-stamps when

    they use a social media Web site

    What can we learn from time-stamps?

    Are there common patterns?

    Can we tell if a user is a bot or a human?

    Sequence of tweets

    time-stamps:

    Bars are tweets

    time-stamps

  • Outline

    Pattern MiningWhat patterns can we discover from temporal activities of social media users?

    Modeling

    Bot Detection

    Experiments

    Conclusion

    3

  • Reddit DatasetTime-stamp from comments

    21,198 users

    20 Million time-stamps

    Twitter DatasetTime-stamp from tweets

    6,790 users

    16 Million time-stamps

    Pattern Mining: Datasets

    For each user we have:

    Sequence of postings time-stamps: T = (t1, t2, t3, )

    Inter-arrival times (IAT) of postings: (1, 2, 3, )

    4

    t1 t2 t3 t4

    1 2 3

    time

  • Pattern Mining

    Pattern 1: Distribution of IAT is heavy-tailedUsers can be inactive for long periods of time before making new postings

    IAT Complementary Cumulative Distribution Function (CCDF)

    (log-log axis)

    5Reddit Users Twitter Users

  • Pattern Mining

    Pattern 2: Bimodal IAT distribution

    Users have highly active sections and resting periods

    Log-binned histogram of postings IAT

    6Twitter Users

    102

    104

    106

    0

    0.005

    0.01

    0.015

    D, IAT (seconds)

    PD

    F

    1st Mode (1min) 2nd Mode (3h)

  • 102

    104

    106

    0

    0.005

    0.01

    D, IAT (seconds)

    PD

    F

    Pattern Mining

    Pattern 3: Periodic spikes

    in the IAT distribution

    Caused by daily sleeping

    intervals

    7

    105

    0

    0.005

    0.01

    0.015

    D, IAT (seconds)

    PD

    F

    7h 12h 24h 48h 72h

    Reddit Users

  • Pattern Mining

    Pattern 4: Consecutive IAT are correlatedLong/short IAT are likely to be followed by long/short IAT

    Heat-map: pairs

    of consecutive IAT

    All Reddit users

    8

    Concentration of

    pairs in the

    diagonal: positive

    correlation

  • Outline

    Pattern Mining

    Modeling

    Can we model the patterns?

    Bot Detection

    Experiments

    Conclusion

    9

  • RSC Model

    Can we generate synthetic time-stamps that match real data patterns?

    10

    PatternPoisson

    Process

    Queue

    BasedBarabsi,

    2005

    CNPPMalmgren

    et al.,

    2009

    SFPVaz de Melo

    et al.,

    2013

    RSCProposed

    Model

    Heavy

    Tails

    Bimodal

    Distribution

    Periodic

    Spikes

    IAT

    Correlation

    Proposed Model: Rest-Sleep-and-Comment

  • RSC Model

    Base model: Self-Correlated Process (SCorr)

    Definition: A stochastic process is a SCorr process with

    base rate and correlation if:

    Consecutive IAT are correlated:

    The i-th IAT i depends on the previous (i-1)-th IAT i-1 controls correlation strength:

    If = 0, SCorr reduces to an exponential distribution11

    X ~ Exp(1/)

    exponential random

    variable with rate i ~ Exp(i-1 + 1/)Details

  • SCorr Process

    RSC Model

    12

    Correlated IAT

    Heavy Tail

    Bimodal Distribution

    Periodic Spikes

    Consecutive IAT DistributionSCorr (synthetic data)

    = 20h, = 0.7

  • RSC Model

    13

    = 20h, = 0.7

    Correlated IAT

    Heavy Tail

    Bimodal Distribution

    Periodic Spikes

    IAT CCDF

    Reddit Data

    SCorr

    SCorr Process

  • RSC Model

    14

    = 20m, = 1.0

    Correlated IAT

    Heavy Tail

    Bimodal Distribution

    Periodic Spikes

    IAT Log-binned Histogram

    Data

    SCorr

    SCorr Process

  • RSC Model

    Model StatesActive:

    1. Wait ~ SCorr(A, A)

    2. Post with probability ppost3. Transition

    Rest:

    1. Wait ~ SCorr(R, R)

    2. Transition

    Base rates: A > RAverage wait time for active state issmaller when compared to rest state

    State Transitions

    15

    Active

    Rest

    1-pR

    pR 1-pA pA

    Details

  • RSC Model

    16

    Heavy Tail

    Correlated IAT

    Bimodal Distribution

    Periodic Spikes

    IAT Log-binned Histogram

    Data

    Synth.

    SCorr + Rest and Active States

  • RSC Model

    Keep track of current time:tclock variable, 0:00h < tclock < 23:59h

    Update tclock after each wait time

    Enter the sleep state if:Current state = rest and

    (tclock < twake or tclock > tsleep)

    In the sleep state:1. Wait until next wake-up time, twake2. Transition to rest state

    17

    tsleep

    twake

    tclock

    Sleep

    Awake

    Modeling periodic spikes: sleep state

    Details

  • RSC Model

    18

    Heavy Tail

    Correlated IAT

    Bimodal Distribution

    Periodic Spikes

    Parameter estimation uses the

    Levenberg-Marquardt algorithm

    IAT Log-binned Histogram

    Complete RSC Model

  • Outline

    Pattern Mining

    Modeling

    Bot DetectionCan we spot automated behavior based only on time-stamp data?

    Experiments

    Conclusion

    19

  • Bot Detection

    Problem: Given labeled time-stamp data from a set of

    users {U1, U2, U3, } decide if a unknown user Ui is a

    human or a bot.

    Solution: RSC-Spotter

    Compare users IAT to synthetic IAT generated by the RSC model

    If not similar to RSC, then is the user is likely to be a bot

    20

    0 10 20 30 40 50 60 70

    Time (days)

    Sequence of time-stamps

    from a single user The user that produced

    the time-stamps is a

    human or a bot?

  • RSC-Spotter

    Comparing Time-stamps

    Estimate the RSC parametersTime-stamps from all users

    For each user:

    1. Compute the IAT histogram Using log-binned bins

    2. Generate synthetic time-stamps using RSC

    RSC can generate the same number of time-stamps as the user

    3. Compare user and synthetic IAT histogram

    Cost sensitive classification is used to decide if a user is a bot given the dissimilarity D 21

    , IAT

    Bin Counts

    (user data)ci

    , IAT

    Bin Counts

    (synthetic) i

    D = i |ci i|

    (dissimilarity)

    Details

  • Outline

    Pattern Mining

    Modeling

    Bot Detection

    ExperimentsCan RSC match real data?

    How well can RSC-Spotter detect bots?

    Conclusion

    22

  • Reddit Users

    Twitter

    Users

    Experiments: Can RSC Match Real Data?

    23

    Pattern CNPP SFP RSC

    Heavy

    Tail

    Bimodal

    Spikes

    IAT

    Correlation

    RSCProposed model

    CNPPMalmgren et al.

    SFPVaz de Melo et al

    CNPP fails to match

    the heavy tail

  • Experiments: Can RSC Match Real Data?

    24

    Pattern CNPP SFP RSC

    Heavy

    Tail

    Bimodal

    Spikes

    IAT

    Correlation

    Two Modes No Periodic

    Spikes

    Reddit Users

    CNPPMalmgren et al.

  • Experiments: Can RSC Match Real Data?

    25

    Pattern CNPP SFP RSC

    Heavy

    Tail

    Bimodal

    Spikes

    IAT

    Correlation

    Reddit Users

    Single Mode No Periodic

    Spikes

    SFPVaz de Melo et al

  • Experiments: Can RSC Match Real Data?

    26

    Pattern CNPP SFP RSC

    Heavy

    Tail

    Bimodal

    Spikes

    IAT

    Correlation

    Reddit Users

    Twitter

    Users

    Two Modes Periodic

    Spikes

    Reddit Users

    RSCProposed model

  • Experiments: Can RSC Match Real Data?

    27

    Pattern CNPP SFP RSC

    Heavy

    Tail

    Bimodal

    Spikes

    IAT

    Correlation

    Twitter

    Data

    CNPP

    Fit

    No IAT

    Correlation

    CNPPMalmgren et al.

  • Experiments: Can RSC Match Real Data?

    28

    Pattern CNPP SFP RSC

    Heavy

    Tail

    Bimodal

    Spikes

    IAT

    Correlation

    Twitter

    Data

    SFP

    Fit

    IAT Correlation

    (but too strong!)

    SFPVaz de Melo et al

  • Experiments: Can RSC Match Real Data?

    29

    Pattern CNPP SFP RSC

    Heavy

    Tail

    Bimodal

    Spikes

    IAT

    Correlation

    Twitter

    Data

    RSC

    Fit

    IAT Correlation

    RSCProposed model

  • Outline

    Pattern Mining

    Modeling

    Bot Detection

    ExperimentsCan RSC Match Real Data?

    How well can RSC-Spotter detect bots?

    Conclusion

    30

  • Experiments: Can RSC-Spotter Detect Bots?

    Methodology

    Datasets

    Users were manually labeled as bot or humans

    Training

    Same size for train and test subsets (preserved class distribution)

    Baseline features:

    31

    1,963 Humans

    37 BotsReddit

    1353 Humans

    64 BotsTwitter

    1. IAT Histogram

    Log-binned IAT

    histogram

    2. Entropy

    Entropy of the

    IAT histogram

    3. Week Hist.

    # of postings

    for day of week

    4. All features

    Combination of

    1, 2 and 3

  • Experiments: Can RSC-Spotter Detect Bots?

    Precision vs. Sensitivity CurvesGood perform