modeling social data, lecture 1: overview

80
Introduction and Overview APAM E4990 Modeling Social Data Jake Hofman Columbia University January 20, 2017 Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 1 / 53

Upload: jakehofman

Post on 24-Jan-2017

224 views

Category:

Education


2 download

TRANSCRIPT

Page 1: Modeling Social Data, Lecture 1: Overview

Introduction and OverviewAPAM E4990

Modeling Social Data

Jake Hofman

Columbia University

January 20, 2017

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 1 / 53

Page 2: Modeling Social Data, Lecture 1: Overview

Course overview

Modeling social data requires an understanding of:

1 How to obtain data produced by (online) human interactions,

2 What questions we typically ask about human-generated data,

3 How to reframe these questions as mathematical models, and

4 How to interpret the results of these models in ways thataddress our questions.

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 2 / 53

Page 3: Modeling Social Data, Lecture 1: Overview

Questions

Many long-standing questions in the social sciences are notoriouslydifficult to answer, e.g.:

• “Who says what to whom in what channel with what effect”?(Laswell, 1948)

• How do ideas and technology spread through cultures?(Rogers, 1962)

• How do new forms of communication affect society?(Singer, 1970)

• . . .

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 3 / 53

Page 4: Modeling Social Data, Lecture 1: Overview

Questions

Typically difficult to observe the relevant information viaconventional methods

Moreno, 1933

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 4 / 53

Page 5: Modeling Social Data, Lecture 1: Overview

Large-scale data

Recently available electronic data provide an unprecedentedopportunity to address these questions at scale

Demographic Behavioral Network

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 5 / 53

Page 6: Modeling Social Data, Lecture 1: Overview

Computational social science

An emerging discipline at the intersection of the social sciences,statistics, and computer science

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 6 / 53

Page 7: Modeling Social Data, Lecture 1: Overview

Computational social science

An emerging discipline at the intersection of the social sciences,statistics, and computer science

(motivating questions)

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 6 / 53

Page 8: Modeling Social Data, Lecture 1: Overview

Computational social science

An emerging discipline at the intersection of the social sciences,statistics, and computer science

(fitting large, potentially sparse models)

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 6 / 53

Page 9: Modeling Social Data, Lecture 1: Overview

Computational social science

An emerging discipline at the intersection of the social sciences,statistics, and computer science

(parallel processing for filtering and aggregating data)

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 6 / 53

Page 10: Modeling Social Data, Lecture 1: Overview

Topics

Exploratory Data Analysis

Classification

Regression

Networks

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 7 / 53

Page 11: Modeling Social Data, Lecture 1: Overview

Exploratory Data Analysis

(a.k.a. counting and plotting things)

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 8 / 53

Page 12: Modeling Social Data, Lecture 1: Overview

Regression

(a.k.a. modeling continuous things)

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 9 / 53

Page 13: Modeling Social Data, Lecture 1: Overview

Classification

(a.k.a. modeling discrete things)

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 10 / 53

Page 14: Modeling Social Data, Lecture 1: Overview

Networks

(a.k.a. counting complicated things)

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 11 / 53

Page 15: Modeling Social Data, Lecture 1: Overview

Topics

http://modelingsocialdata.org

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 12 / 53

Page 16: Modeling Social Data, Lecture 1: Overview

The clean real story

“We have a habit in writing articles published inscientific journals to make the work as finished aspossible, to cover all the tracks, to not worry about theblind alleys or to describe how you had the wrong ideafirst, and so on. So there isn’t any place to publish, ina dignified manner, what you actually did in order toget to do the work ...”

-Richard FeynmanNobel Lecture1, 1965

1http://bit.ly/feynmannobelJake Hofman (Columbia University) Introduction and Overview January 20, 2017 13 / 53

Page 17: Modeling Social Data, Lecture 1: Overview

Outline

Web demographicsD

aily

Per

−C

apita

Pag

evie

ws

0

10

20

30

40

50

60

70

●●

Over $25k

Under $25k

Black

&

Hispanic

White

No College

Some College

Over 65

Under 65

Female

Male

Income Race Education Age Sex

Search predictions"Right Round"

Week

Ran

k

40

30

20

10

cccccccccccccccccccccccccccccccccccccccccc

Mar−09 Apr−09 May−09 Jun−09 Jul−09 Aug−09

BillboardSearch

Viral hits

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 14 / 53

Page 18: Modeling Social Data, Lecture 1: Overview

Predicting consumer activity with Web searchwith Sharad Goel, Sebastien Lahaie, David Pennock, Duncan Watts

"Right Round"

Week

Ran

k

40

30

20

10

cccccccccccccccccccccccccccccccccccccccccc

Mar−09 Apr−09 May−09 Jun−09 Jul−09 Aug−09

BillboardSearch

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 15 / 53

Page 19: Modeling Social Data, Lecture 1: Overview

Search predictionsMotivation

Does collective search activityprovide useful predictive signalabout real-world outcomes?

"Right Round"

Week

Ran

k

40

30

20

10

cccccccccccccccccccccccccccccccccccccccccc

Mar−09 Apr−09 May−09 Jun−09 Jul−09 Aug−09

BillboardSearch

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 16 / 53

Page 20: Modeling Social Data, Lecture 1: Overview

Search predictionsMotivation

Past work mainly focuses on predicting the present2 and ignoresbaseline models trained on publicly available data

Date

Flu

Lev

el (

Per

cent

)

1

2

3

4

5

6

7

8

2004 2005 2006 2007 2008 2009 2010

Actual

Search

Autoregressive

2Varian, 2009Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 17 / 53

Page 21: Modeling Social Data, Lecture 1: Overview

Search predictionsMotivation

We predict future sales for movies, video games, and music

"Transformers 2"

Time to Release (Days)

Sea

rch

Volu

me

a

−30 −20 −10 0 10 20 30

"Tom Clancy's HAWX"

Time to Release (Days)

Sea

rch

Volu

me

b

−30 −20 −10 0 10 20 30

"Right Round"

Week

Ran

k

40

30

20

10

cccccccccccccccccccccccccccccccccccccccccc

Mar−09 Apr−09 May−09 Jun−09 Jul−09 Aug−09

Billboard

Search

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 18 / 53

Page 22: Modeling Social Data, Lecture 1: Overview

Search predictionsSearch models

For movies and video games, predict opening weekend box officeand first month sales, respectively:

log(revenue) = β0 + β1 log(search) + ε

For music, predict following week’s Billboard Hot 100 rank:

billboardt+1 = β0 + β1searcht + β2searcht−1 + ε

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 19 / 53

Page 23: Modeling Social Data, Lecture 1: Overview

Search predictionsSearch volume

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 20 / 53

Page 24: Modeling Social Data, Lecture 1: Overview

Search predictionsSearch models

Search activity is predictive for movies, video games, and musicweeks to months in advance

Movies

Predicted Revenue (Dollars)

Actu

al Re

venu

e (D

ollar

s)

103

104

105

106

107

108

109

●●

●●

● ●

●●

●●

●●

●●●

● ●

●●●

●●●

●●

●●

●●

●●

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

103 104 105 106 107 108 109

Video Games

Predicted Revenue (Dollars)

Actu

al Re

venu

e (D

ollar

s)103

104

105

106

107

●●

●●

●●

●●

●●

●●

● ●

●●

bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb

103 104 105 106 107

● Non−SequelSequel

Music

Predicted Billboard Rank

Actu

al Bi

llboa

rd R

ank

0

20

40

60

80

100

●●

●●

●●

●●

●●

●●

● ●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

c

0 20 40 60 80 100

Movies

Time to Release (Weeks)

Mod

el Fi

t

0.4

0.5

0.6

0.7

0.8

0.9 ddddddd

−6 −5 −4 −3 −2 −1 0

Video Games

Time to Release (Weeks)

Mod

el Fi

t

0.4

0.5

0.6

0.7

0.8

0.9 eeeeeee

−6 −5 −4 −3 −2 −1 0

Music

Time to Release (Weeks)M

odel

Fit

0.4

0.5

0.6

0.7

0.8

0.9 fffffff

−6 −5 −4 −3 −2 −1 0

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 21 / 53

Page 25: Modeling Social Data, Lecture 1: Overview

Search predictionsBaseline models

For movies, use budget, number of opening screens and HollywoodStock Exchange:

log(revenue) = β0 + β1 log(budget) + β2 log(screens) +

β3 log(hsx) + ε

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 22 / 53

Page 26: Modeling Social Data, Lecture 1: Overview

Search predictionsBaseline models

For video games, use critic ratings and predecessor sales (sequelsonly):

log(revenue) = β0 + β1rating + β2 log(predecessor) + ε

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 22 / 53

Page 27: Modeling Social Data, Lecture 1: Overview

Search predictionsBaseline models

For music, use an autoregressive model with the previouslyavailable rank:

billboardt+1 = β0 + β1billboardt−1 + ε

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 22 / 53

Page 28: Modeling Social Data, Lecture 1: Overview

Search predictionsBaseline + combined models

Baseline models are often surprisingly good

Movies (Baseline)

Predicted Revenue (Dollars)

Actu

al Re

venu

e (D

ollar

s)

103

104

105

106

107

108

109

●●

●●

● ●

●●

●●●

● ●

●●

● ●

●●●●

●●●

●●

●●

●●

●●

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

103 104 105 106 107 108 109

Video Games (Baseline)

Predicted Revenue (Dollars)

Actu

al Re

venu

e (D

ollar

s)103

104

105

106

107

●●

●●

●●

●●

●●

●●

●●

●●

bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb

103 104 105 106 107

● Non−SequelSequel

Music (Baseline)

Predicted Billboard Rank

Actu

al Bi

llboa

rd R

ank

0

20

40

60

80

100

●●

●●

●●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

c

0 20 40 60 80 100

Movies (Combined)

Predicted Revenue (Dollars)

Actu

al Re

venu

e (D

ollar

s)

103

104

105

106

107

108

109

●●

●●

● ●

●●

●●

●●

●●

● ●

●●●●

●●●

●●

●●

●●

●●

ddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd

103 104 105 106 107 108 109

Video Games (Combined)

Predicted Revenue (Dollars)

Actu

al Re

venu

e (D

ollar

s)

103

104

105

106

107

●●

●●

●●

●●

●●

●●

●●

●●

eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee

103 104 105 106 107

● Non−SequelSequel

Music (Combined)

Predicted Billboard Rank

Actu

al Bi

llboa

rd R

ank

0

20

40

60

80

100

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

f

0 20 40 60 80 100

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 23 / 53

Page 29: Modeling Social Data, Lecture 1: Overview

Search predictionsModel comparison

For movies, search is outperformed by the baseline and of littlemarginal value

M

odel

Fit

0.4

0.5

0.6

0.7

0.8

0.9

1.0

CombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombined

SearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearch

BaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaseline

Nonse

quel

Games

Seque

l Gam

es

Mus

ic

Mov

ies Flu

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 24 / 53

Page 30: Modeling Social Data, Lecture 1: Overview

Search predictionsModel comparison

For video games, search helps substantially for non-sequels, less sofor sequels

M

odel

Fit

0.4

0.5

0.6

0.7

0.8

0.9

1.0

CombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombined

SearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearch

BaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaseline

Nonse

quel

Games

Seque

l Gam

es

Mus

ic

Mov

ies Flu

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 24 / 53

Page 31: Modeling Social Data, Lecture 1: Overview

Search predictionsModel comparison

For music, the addition of search yields a substantially bettercombined model

M

odel

Fit

0.4

0.5

0.6

0.7

0.8

0.9

1.0

CombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombined

SearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearch

BaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaseline

Nonse

quel

Games

Seque

l Gam

es

Mus

ic

Mov

ies Flu

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 24 / 53

Page 32: Modeling Social Data, Lecture 1: Overview

Search predictionsSummary

• Relative performance and value of search varies acrossdomains

• Search provides a fast, convenient, and flexible signal acrossdomains

• “Predicting consumer activity with Web search”Goel, Hofman, Lahaie, Pennock & Watts, PNAS 2010

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 25 / 53

Page 33: Modeling Social Data, Lecture 1: Overview

Demographic diversity on the Webwith Irmak Sirer and Sharad Goel (ICWSM 2012)

Dai

ly P

er−

Cap

ita P

agev

iew

s

0

10

20

30

40

50

60

70

●●

Over $25k

Under $25k

Black

&

Hispanic

White

No College

Some College

Over 65

Under 65

Female

Male

Income Race Education Age Sex

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 26 / 53

Page 34: Modeling Social Data, Lecture 1: Overview

Motivation

Previous work is largely survey-based and focuses and group-leveldifferences in online access

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 27 / 53

Page 35: Modeling Social Data, Lecture 1: Overview

Motivation

“As of January 1997, we estimate that 5.2 millionAfrican Americans and 40.8 million whites have ever usedthe Web, and that 1.4 million African Americans and20.3 million whites used the Web in the past week.”

-Hoffman & Novak (1998)

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 27 / 53

Page 36: Modeling Social Data, Lecture 1: Overview

Motivation

Focus on activity instead of access

How diverse is the Web?

To what extent do online experiences vary across demographicgroups?

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 28 / 53

Page 37: Modeling Social Data, Lecture 1: Overview

Data

• Representative sample of 265,000 individuals in the US, paidvia the Nielsen MegaPanel3

• Log of anonymized, complete browsing activity from June2009 through May 2010 (URLs viewed, timestamps, etc.)

• Detailed individual and household demographic information(age, education, income, race, sex, etc.)

3Special thanks to Mainak MazumdarJake Hofman (Columbia University) Introduction and Overview January 20, 2017 29 / 53

Page 38: Modeling Social Data, Lecture 1: Overview

Data

# ls -alh nielsen_megapanel.tar

-rw-r--r-- 100G Jul 17 13:00 nielsen_megapanel.tar

• Normalize pageviews to at most three domain levels, sans www

e.g. www.yahoo.com → yahoo.com,us.mg2.mail.yahoo.com/neo/launch → mail.yahoo.com

• Restrict to top 100k (out of 9M+ total) most popular sites(by unique visitors)

• Aggregate activity at the site, group, and user levels

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 30 / 53

Page 39: Modeling Social Data, Lecture 1: Overview

Data

# ls -alh nielsen_megapanel.tar

-rw-r--r-- 100G Jul 17 13:00 nielsen_megapanel.tar

• Normalize pageviews to at most three domain levels, sans www

e.g. www.yahoo.com → yahoo.com,us.mg2.mail.yahoo.com/neo/launch → mail.yahoo.com

• Restrict to top 100k (out of 9M+ total) most popular sites(by unique visitors)

• Aggregate activity at the site, group, and user levels

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 30 / 53

Page 40: Modeling Social Data, Lecture 1: Overview

Data

# ls -alh nielsen_megapanel.tar

-rw-r--r-- 100G Jul 17 13:00 nielsen_megapanel.tar

• Normalize pageviews to at most three domain levels, sans www

e.g. www.yahoo.com → yahoo.com,us.mg2.mail.yahoo.com/neo/launch → mail.yahoo.com

• Restrict to top 100k (out of 9M+ total) most popular sites(by unique visitors)

• Aggregate activity at the site, group, and user levels

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 30 / 53

Page 41: Modeling Social Data, Lecture 1: Overview

Data

# ls -alh nielsen_megapanel.tar

-rw-r--r-- 100G Jul 17 13:00 nielsen_megapanel.tar

• Normalize pageviews to at most three domain levels, sans www

e.g. www.yahoo.com → yahoo.com,us.mg2.mail.yahoo.com/neo/launch → mail.yahoo.com

• Restrict to top 100k (out of 9M+ total) most popular sites(by unique visitors)

• Aggregate activity at the site, group, and user levels

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 30 / 53

Page 42: Modeling Social Data, Lecture 1: Overview

Aggregate usage patterns

How do users distribute their time across different categories?

Fra

ctio

n of

tota

l pag

evie

ws

0.05

0.10

0.15

0.20

0.25●

● ●

Social

Med

ia

E−mail

Games

Porta

ls

Searc

h

All groups spend the majority of their time in the top five mostpopular categories

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 31 / 53

Page 43: Modeling Social Data, Lecture 1: Overview

Aggregate usage patterns

How do users distribute their time across different categories?

User Rank by Daily Activity

Fra

ctio

n of

Pag

evie

ws

in C

ateg

ory

0.05

0.10

0.15

0.20

0.25

0.30

● ● ● ●●

10% 30% 50% 70% 90%

● Social Media

E−mail

Games

Portals

Search

Highly active users devote nearly twice as much of their time tosocial media relative to typical individuals

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 31 / 53

Page 44: Modeling Social Data, Lecture 1: Overview

Group-level activity

How does browsing activity vary at the group level?

Dai

ly P

er−

Cap

ita P

agev

iew

s

0

10

20

30

40

50

60

70

●●

Over $25k

Under $25k

Black

&

Hispanic

White

No College

Some College

Over 65

Under 65

Female

Male

Income Race Education Age Sex

Large differences exist even at the aggregate level(e.g. women on average generate 40% more pageviews than men)

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 32 / 53

Page 45: Modeling Social Data, Lecture 1: Overview

Group-level activity

How does browsing activity vary at the group level?

Dai

ly P

er−

Cap

ita P

agev

iew

s

0

10

20

30

40

50

60

70

●●

Over $25k

Under $25k

Black

&

Hispanic

White

No College

Some College

Over 65

Under 65

Female

Male

Income Race Education Age Sex

Younger and more educated individuals are both more likely toaccess the Web and more active once they do

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 32 / 53

Page 46: Modeling Social Data, Lecture 1: Overview

Group-level activity

All demographic groups spend the majority of their time in thesame categories

Age

Fra

ctio

n of

tota

l pag

evie

ws

0.0

0.1

0.2

0.3

0.4

0.5

●●

● ●

●●

● ●

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80

● Social Media

E−mail

Games

Portals

Search

Fr

actio

n of

tota

l pag

evie

ws

0.0

0.1

0.2

0.3

0.4Education

● ●

●●

Grammar

Schoo

l

Some H

igh Sch

ool

High Sch

ool G

radua

te

Some C

ollege

Associa

te Deg

ree

Bache

lor's D

egree

Post G

radua

te Deg

ree

Sex

Female Male

Income

●● ●

●●

$0−25k

$25−50k

$50−75k

$75−100k

$100−150k

$150k+

Race

● ●● ●

Other

Hispan

icBlack

White

Asian

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 33 / 53

Page 47: Modeling Social Data, Lecture 1: Overview

Group-level activity

Older, more educated, male, wealthier, and Asian Internet usersspend a smaller fraction of their time on social media

Age

Fra

ctio

n of

tota

l pag

evie

ws

0.0

0.1

0.2

0.3

0.4

0.5

●●

● ●

●●

● ●

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80

● Social Media

E−mail

Games

Portals

Search

Fr

actio

n of

tota

l pag

evie

ws

0.0

0.1

0.2

0.3

0.4Education

● ●

●●

Grammar

Schoo

l

Some H

igh Sch

ool

High Sch

ool G

radua

te

Some C

ollege

Associa

te Deg

ree

Bache

lor's D

egree

Post G

radua

te Deg

ree

Sex

Female Male

Income

●● ●

●●

$0−25k

$25−50k

$50−75k

$75−100k

$100−150k

$150k+

Race

● ●● ●

Other

Hispan

icBlack

White

Asian

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 33 / 53

Page 48: Modeling Social Data, Lecture 1: Overview

Group-level activity

Lower social media use by these groups is often accompanied byhigher e-mail volume

Age

Fra

ctio

n of

tota

l pag

evie

ws

0.0

0.1

0.2

0.3

0.4

0.5

●●

● ●

●●

● ●

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80

● Social Media

E−mail

Games

Portals

Search

Fr

actio

n of

tota

l pag

evie

ws

0.0

0.1

0.2

0.3

0.4Education

● ●

●●

Grammar

Schoo

l

Some H

igh Sch

ool

High Sch

ool G

radua

te

Some C

ollege

Associa

te Deg

ree

Bache

lor's D

egree

Post G

radua

te Deg

ree

Sex

Female Male

Income

●● ●

●●

$0−25k

$25−50k

$50−75k

$75−100k

$100−150k

$150k+

Race

● ●● ●

Other

Hispan

icBlack

White

Asian

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 33 / 53

Page 49: Modeling Social Data, Lecture 1: Overview

Revisiting the digital divide

How does usage of news, health, and reference vary withdemographics?

A

vera

ge p

agev

iew

s pe

r mon

th

0

2

4

6

8

10

12Education

● ●

Grammar

Schoo

l

Some H

igh Sch

ool

High Sch

ool G

radua

te

Some C

ollege

Associa

te Deg

ree

Bache

lor's D

egree

Post G

radua

te Deg

ree

Sex

Female Male

Income

● ● ●●

$0−25k

$25−50k

$50−75k

$75−100k

$100−150k

$150k+

Race

● ●●

Other

Hispan

icBlack

White

Asian

● NewsHealthReference

Post-graduates spend three times as much time on health sitesthan adults with only some high school education

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 34 / 53

Page 50: Modeling Social Data, Lecture 1: Overview

Revisiting the digital divide

How does usage of news, health, and reference vary withdemographics?

A

vera

ge p

agev

iew

s pe

r mon

th

0

2

4

6

8

10

12Education

● ●

Grammar

Schoo

l

Some H

igh Sch

ool

High Sch

ool G

radua

te

Some C

ollege

Associa

te Deg

ree

Bache

lor's D

egree

Post G

radua

te Deg

ree

Sex

Female Male

Income

● ● ●●

$0−25k

$25−50k

$50−75k

$75−100k

$100−150k

$150k+

Race

● ●●

Other

Hispan

icBlack

White

Asian

● NewsHealthReference

Asians spend more than 50% more time browsing online news thando other race groups

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 34 / 53

Page 51: Modeling Social Data, Lecture 1: Overview

Revisiting the digital divide

How does usage of news, health, and reference vary withdemographics?

A

vera

ge p

agev

iew

s pe

r mon

th

0

2

4

6

8

10

12Education

● ●

Grammar

Schoo

l

Some H

igh Sch

ool

High Sch

ool G

radua

te

Some C

ollege

Associa

te Deg

ree

Bache

lor's D

egree

Post G

radua

te Deg

ree

Sex

Female Male

Income

● ● ●●

$0−25k

$25−50k

$50−75k

$75−100k

$100−150k

$150k+

Race

● ●●

Other

Hispan

icBlack

White

Asian

● NewsHealthReference

Even when less educated and less wealthy groups gain access tothe Web, they utilize these resources relatively infrequently

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 34 / 53

Page 52: Modeling Social Data, Lecture 1: Overview

Revisiting the digital divide

How does usage of news, health, and reference vary withdemographics?

A

vera

ge p

agev

iew

s pe

r m

onth

0

2

4

6

8

10

12

News

● ●

High S

choo

l Gra

duat

e

Some

Colleg

e

Assoc

iate

Degre

e

Bache

lor's

Degre

e

Post G

radu

ate

Degre

e

Health

●● ●

●●

High S

choo

l Gra

duat

e

Some

Colleg

e

Assoc

iate

Degre

e

Bache

lor's

Degre

e

Post G

radu

ate

Degre

e

Reference

●● ●

● ●

High S

choo

l Gra

duat

e

Some

Colleg

e

Assoc

iate

Degre

e

Bache

lor's

Degre

e

Post G

radu

ate

Degre

e

Asian

Black

Hispanic

White

Controlling for other variables, effects of race and gender largelydisappear, while education continues to have large effect

pi =∑j

αjxij +∑j

∑k

βjkxijxik +∑j

γjx2ij + εi

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 35 / 53

Page 53: Modeling Social Data, Lecture 1: Overview

Revisiting the digital divide

How does usage of news, health, and reference vary withdemographics?

A

vera

ge p

agev

iew

s pe

r m

onth

0

2

4

6

8

10

12

Health

●● ●

● ●

High S

choo

l Gra

duat

e

Some

Colleg

e

Assoc

iate

Degre

e

Bache

lor's

Degre

e

Post G

radu

ate

Degre

e

Female

Male

However, women spend considerably more time on health sitescompared to men

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 36 / 53

Page 54: Modeling Social Data, Lecture 1: Overview

Revisiting the digital divide

How does usage of news, health, and reference vary withdemographics?

Monthly pageviews on health sites

20 40 60 80 100

Female

Male

However, women spend considerably more time on health sitescompared to men, although means can be misleading

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 36 / 53

Page 55: Modeling Social Data, Lecture 1: Overview

Summary

• Highly active users spend disproportionately more of theirtime on social media and less on e-mail relative to the overallpopulation

• Access to research, news, and healthcare is strongly related toeducation, not as closely to ethnicity

• User demographics can be inferred from browsing activity withreasonable accuracy

• “Who Does What on the Web”, Goel, Hofman & Sirer,ICWSM 2012

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 37 / 53

Page 56: Modeling Social Data, Lecture 1: Overview

The structural virality of online diffusionwith Ashton Anderson, Sharad Goel, Duncan Watts (Management Science 2015)

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 38 / 53

Page 57: Modeling Social Data, Lecture 1: Overview

“Going Viral”?

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 39 / 53

Page 58: Modeling Social Data, Lecture 1: Overview

“Going Viral”?

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 40 / 53

Page 59: Modeling Social Data, Lecture 1: Overview

“Going Viral”?

“Therefore we ... wish to proceed with great care as isproper, and to cut off the advance of this plague andcancerous disease so it will not spread any further ...”4

-Pope Leo XExsurge Domine (1520)

4http://www.economist.com/node/21541719Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 40 / 53

Page 60: Modeling Social Data, Lecture 1: Overview

“Going Viral”?

Rogers (1962), Bass (1969)

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 41 / 53

Page 61: Modeling Social Data, Lecture 1: Overview

“Going viral”?

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 42 / 53

Page 62: Modeling Social Data, Lecture 1: Overview

“Going viral”?

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 42 / 53

Page 63: Modeling Social Data, Lecture 1: Overview

“Going viral”?

How do popular things become popular?

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 43 / 53

Page 64: Modeling Social Data, Lecture 1: Overview

Data

• Examined one year of tweets from July 2011 to July 2012

• Restricted to 1.4 billion tweets containing links to top news,videos, images, and petitions sites

• Aggregated tweets by URL, resulting in 1 billion distinct“events”

• Crawled friend list of each adopter

• Inferred “who got what from whom” to construct diffusiontrees

• Characterized size and structure of trees

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 44 / 53

Page 65: Modeling Social Data, Lecture 1: Overview

Data

• Examined one year of tweets from July 2011 to July 2012

• Restricted to 1.4 billion tweets containing links to top news,videos, images, and petitions sites

• Aggregated tweets by URL, resulting in 1 billion distinct“events”

• Crawled friend list of each adopter

• Inferred “who got what from whom” to construct diffusiontrees

• Characterized size and structure of trees

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 44 / 53

Page 66: Modeling Social Data, Lecture 1: Overview

Data

• Examined one year of tweets from July 2011 to July 2012

• Restricted to 1.4 billion tweets containing links to top news,videos, images, and petitions sites

• Aggregated tweets by URL, resulting in 1 billion distinct“events”

• Crawled friend list of each adopter

• Inferred “who got what from whom” to construct diffusiontrees

• Characterized size and structure of trees

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 44 / 53

Page 67: Modeling Social Data, Lecture 1: Overview

Data

• Examined one year of tweets from July 2011 to July 2012

• Restricted to 1.4 billion tweets containing links to top news,videos, images, and petitions sites

• Aggregated tweets by URL, resulting in 1 billion distinct“events”

• Crawled friend list of each adopter

• Inferred “who got what from whom” to construct diffusiontrees

• Characterized size and structure of trees

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 44 / 53

Page 68: Modeling Social Data, Lecture 1: Overview

Data

• Examined one year of tweets from July 2011 to July 2012

• Restricted to 1.4 billion tweets containing links to top news,videos, images, and petitions sites

• Aggregated tweets by URL, resulting in 1 billion distinct“events”

• Crawled friend list of each adopter

• Inferred “who got what from whom” to construct diffusiontrees

• Characterized size and structure of trees

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 44 / 53

Page 69: Modeling Social Data, Lecture 1: Overview

Data

• Examined one year of tweets from July 2011 to July 2012

• Restricted to 1.4 billion tweets containing links to top news,videos, images, and petitions sites

• Aggregated tweets by URL, resulting in 1 billion distinct“events”

• Crawled friend list of each adopter

• Inferred “who got what from whom” to construct diffusiontrees

• Characterized size and structure of trees

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 44 / 53

Page 70: Modeling Social Data, Lecture 1: Overview

The Structural Virality of Online Diffusion

A

B

D

C

E

Tim

e

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 45 / 53

Page 71: Modeling Social Data, Lecture 1: Overview

Information diffusionCascade size distribution

0.00001%

0.0001%

0.001%

0.01%

0.1%

1%

10%

1 10 100 1,000 10,000

Cascade Size

CC

DF

Focus on the rare hits that get at least 100 adoptions

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 46 / 53

Page 72: Modeling Social Data, Lecture 1: Overview

Quantifying structure

Measure the average distance between all pairs of nodes5

5Weiner (1947); correlated with other possible metricsJake Hofman (Columbia University) Introduction and Overview January 20, 2017 47 / 53

Page 73: Modeling Social Data, Lecture 1: Overview

Information diffusionSize and virality by category

Remarkable structural diversity across across categories

0.001%

0.01%

0.1%

1%

10%

100%

100 1,000 10,000

Cascade Size

CC

DF

Videos

Pictures

News

Petitions

0.001%

0.01%

0.1%

1%

10%

100%

3 10 30

Structural Virality

CC

DF

Videos

Pictures

News

Petitions

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 48 / 53

Page 74: Modeling Social Data, Lecture 1: Overview

Information diffusionStructural diversity

0 50 100 150time

size

0 5 10 15 20time

size

0 20 40 60 80 100 120 140time

size

0 20 40 60 80 100 120time

size

0.0 0.5 1.0 1.5time

size

0 10 20 30 40 50 60 70time

size

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 49 / 53

Page 75: Modeling Social Data, Lecture 1: Overview

Information diffusionStructural diversity

Size is relatively poor predictive of structure

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 50 / 53

Page 76: Modeling Social Data, Lecture 1: Overview

Summary

Popular 6= Viral

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 51 / 53

Page 77: Modeling Social Data, Lecture 1: Overview

Information diffusionSummary

• Most cascades fail, resulting in fewer than two adoptions, onaverage

• Of the hits that do succeed, we observe a wide range ofdiverse diffusion structures

• It’s difficult to say how something spread given only itspopularity

• “The structural virality of online diffusion”, Anderson, Goel,Hofman & Watts (Management Science 2015)

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 52 / 53

Page 78: Modeling Social Data, Lecture 1: Overview

1. Ask good questionsThere’s nothing interesting in the data without them

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 53 / 53

Page 79: Modeling Social Data, Lecture 1: Overview

2. Think before you code5 minutes at the whiteboard is worth an hour at the keyboard

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 53 / 53

Page 80: Modeling Social Data, Lecture 1: Overview

3. Keep the answers simpleExploratory data analysis and linear models go a long way

Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 53 / 53