mining social data for fun and insight

85
Social Data Mining Toby Segaran

Upload: adunne

Post on 17-Jan-2015

35.084 views

Category:

Technology


7 download

DESCRIPTION

Speaker: Toby Segaran

TRANSCRIPT

Page 1: Mining Social Data for Fun and Insight

Social Data Mining

Toby Segaran

Page 2: Mining Social Data for Fun and Insight

About Me

http://kiwitobes.com

Page 3: Mining Social Data for Fun and Insight

What is data mining?

ImplicitUnknownUseful

Page 4: Mining Social Data for Fun and Insight

What is data?

Page 5: Mining Social Data for Fun and Insight

Data-mining traditional uses

Page 6: Mining Social Data for Fun and Insight

Why it’s important now

Data

Page 7: Mining Social Data for Fun and Insight

Why it’s important now

Page 8: Mining Social Data for Fun and Insight

Why it’s important now

All products are actually sold on Amazon

Page 9: Mining Social Data for Fun and Insight

Why it’s important nowFacebook Google

Page 10: Mining Social Data for Fun and Insight

Why it’s important now

Page 11: Mining Social Data for Fun and Insight

For Social Insight

Home Prices Blogs and News Movie Data

Fashion Product Prices Hotties

Page 12: Mining Social Data for Fun and Insight

Blogs…

Page 13: Mining Social Data for Fun and Insight

The Technorati Top 100

Page 14: Mining Social Data for Fun and Insight

Getting the content

The

Six

Degrees

Hypothesis

Experienced

It

Is

When

You

Travel

Page 15: Mining Social Data for Fun and Insight

Building a Word MatrixThe

Six

Degrees

Hypothesis

Experienced

It

Is

When

You

Travel

Six

Degrees

Hypothesis

Experienced

Travel

Six 3Degrees 3Hypothesis 1Experienced 5Travel 6

Page 16: Mining Social Data for Fun and Insight

The Word Matrix

“china” “kids” “music” “travel” “yahoo”

Gothamist 0 3 3 3

4

0

6

0

GigaOM 6 0 1 2

QuickOnlineTips 0 2 2 12

O’Reilly Radar 1 0 3 4

Page 17: Mining Social Data for Fun and Insight

Determining distance

“china” “kids” “music” “yahoo”

Gothamist 0 3 3 0

GigaOM 6 0 1 2

Quick Online Tips 0 2 2 12

Euclidean “as the crow flies”

2222 )122()21()20()06( −+−+−+−

= 12 (approx)

Page 18: Mining Social Data for Fun and Insight

Hierarchical Clustering

Find the two closest itemCombine them into a single itemRepeat…

Page 19: Mining Social Data for Fun and Insight

Hierarchical Algorithm

Page 20: Mining Social Data for Fun and Insight

Hierarchical Algorithm

Page 21: Mining Social Data for Fun and Insight

Hierarchical Algorithm

Page 22: Mining Social Data for Fun and Insight

Hierarchical Algorithm

Page 23: Mining Social Data for Fun and Insight

Hierarchical Algorithm

Page 24: Mining Social Data for Fun and Insight

Dendrogram

Page 25: Mining Social Data for Fun and Insight

Hierarchical Blog Clusters

Page 26: Mining Social Data for Fun and Insight

Hierarchical Blog Clusters

Page 27: Mining Social Data for Fun and Insight

Hierarchical Blog Clusters

Page 28: Mining Social Data for Fun and Insight

Rotating the Matrix

Words in a blog -> blogs containing each word

Gothamist GigaOM Quick Onlchina 0 6 0kids 3 0 2music 3 1 2Yahoo 0 2 12

Page 29: Mining Social Data for Fun and Insight

Hierarchical Word Clusters

Page 30: Mining Social Data for Fun and Insight

K-Means Clustering

Divides data into distinct clustersUser determines how manyAlgorithm

Start with arbitrary centroidsAssign points to centroidsMove the centroidsRepeat

Page 31: Mining Social Data for Fun and Insight

K-Means Algorithm

Page 32: Mining Social Data for Fun and Insight

K-Means Algorithm

Page 33: Mining Social Data for Fun and Insight

K-Means Algorithm

Page 34: Mining Social Data for Fun and Insight

K-Means Algorithm

Page 35: Mining Social Data for Fun and Insight

K-Means Algorithm

Page 36: Mining Social Data for Fun and Insight

K-Means Results

The Viral GardenCopybloggerCreating Passionate UsersOilmanProBlogger Blog TipsSeth's Blog

WonketteGawkerGothamistHuffington Post

1 2

Page 37: Mining Social Data for Fun and Insight

2D Visualizations

Instead of Clusters, a 2D MapGoals

Preserve distances as much as possibleDraw in two dimensions

Dimension ReductionPrincipal Components AnalysisMultidimensional Scaling

Page 38: Mining Social Data for Fun and Insight

Multidimensional Scaling

Page 39: Mining Social Data for Fun and Insight

Multidimensional Scaling

Page 40: Mining Social Data for Fun and Insight

Multidimensional Scaling

Page 41: Mining Social Data for Fun and Insight

Multidimensional Scaling

Page 42: Mining Social Data for Fun and Insight
Page 43: Mining Social Data for Fun and Insight
Page 44: Mining Social Data for Fun and Insight
Page 45: Mining Social Data for Fun and Insight
Page 46: Mining Social Data for Fun and Insight

Zillow

Page 47: Mining Social Data for Fun and Insight

The Zillow API

Allows querying by addressReturns information about the property

BedroomsBathroomsZip CodePrice EstimateLast Sale Price

Page 48: Mining Social Data for Fun and Insight

A home price dataset

House Zip Bathrooms Bedrooms Built Type Price

A 02138 1.5 2 1847

1916

1894

1854

1909

1930

Single 505296

B 02139 3.5 9 Triplex 776378

C 02140 3.5 4 Duplex 595027

D 02139 2.5 4 Duplex 552213

E 02138 3.5 5 Duplex 947528

F 02138 3.5 4 Single 2107871

etc..

Page 49: Mining Social Data for Fun and Insight

What can we learn?

A made-up houses priceHow important is Zip Code?What are the important attributes?

Can we do better than averages?

Page 50: Mining Social Data for Fun and Insight

Introducing Regression Trees

A B Value

10 Circle 20

11 Square 22

22 Square 8

18 Circle 6

Page 51: Mining Social Data for Fun and Insight

Introducing Regression Trees

A B Value

10 Circle 20

11 Square 22

22 Square 8

18 Circle 6

Page 52: Mining Social Data for Fun and Insight

Minimizing deviation

Standard deviation is the “spread” of resultsTry all possible divisionsChoose the division that decreases deviation the most

A B Value

10 Circle 20

11 Square 22

22 Square 8

18 Circle 6

InitiallyAverage = 14

Standard Deviation = 8.2

Page 53: Mining Social Data for Fun and Insight

Minimizing deviation

Standard deviation is the “spread” of resultsTry all possible divisionsChoose the division that decreases deviation the most

A B Value

10 Circle 20

11 Square 22

22 Square 8

18 Circle 6

B = CircleAverage = 13

Standard Deviation = 9.9

B = SquareAverage = 15

Standard Deviation = 9.9

Page 54: Mining Social Data for Fun and Insight

Minimizing deviation

Standard deviation is the “spread” of resultsTry all possible divisionsChoose the division that decreases deviation the most

A B Value

10 Circle 20

11 Square 22

22 Square 8

18 Circle 6

A > 18Average = 8

Standard Deviation = 0

A <= 20Average = 16

Standard Deviation = 8.7

Page 55: Mining Social Data for Fun and Insight

Minimizing deviation

Standard deviation is the “spread” of resultsTry all possible divisionsChoose the division that decreases deviation the most

A B Value

10 Circle 20

11 Square 22

22 Square 8

18 Circle 6

A > 11Average = 7

Standard Deviation = 1.4

A <= 11Average = 21

Standard Deviation = 1.4

Page 56: Mining Social Data for Fun and Insight

CART Algoritm

A B Value

10 Circle 20

11 Square 22

22 Square 8

18 Circle 6

Page 57: Mining Social Data for Fun and Insight

CART Algoritm

A B Value

10 Circle 20

11 Square 22

22 Square 8

18 Circle 6

Page 58: Mining Social Data for Fun and Insight

CART Algoritm

10 Circle 20

11 Square 22

22 Square 8

18 Circle 6

Page 59: Mining Social Data for Fun and Insight

CART Algoritm

Page 60: Mining Social Data for Fun and Insight

Zillow Results

Bathrooms > 3

Zip: 02139? After 1903?

Triplex?Duplex?Bedrooms > 4?Zip: 02140?

Page 61: Mining Social Data for Fun and Insight

Just for Fun… Hot or Not

Page 62: Mining Social Data for Fun and Insight

Variance dividers

0123456789

Northeast South Male Female

Low Variance Split High Variance Split

Page 63: Mining Social Data for Fun and Insight

Just for Fun… Hot or Not

Page 64: Mining Social Data for Fun and Insight

Supervised and Unsupervised

Clustering methods are unsupervisedThere are no answersMethods just characterize the dataShow interesting patterns

Regression Trees are supervised“answers” are in the datasetTree models predict answers

Page 65: Mining Social Data for Fun and Insight

Personal Ads

Page 66: Mining Social Data for Fun and Insight

The AnalysisFive Cities

W4M Personal Ads

Page 67: Mining Social Data for Fun and Insight

Bayesian filter

If you listen to NPR, watch Hardball, and love the Red Sox, you may be the guy for me.

Please email me back.

I'm a professional with a grad school degree who has a sense of humor and loves the Sox.

Sox 0.4

Red 0.35

Grad 0.2

Professional 0.1

Humor 0.1

Boston

Page 68: Mining Social Data for Fun and Insight

Bayesian filter

P( C | W ) = P (C & W) / P (W)

How often do the word and the city appear together?

How often does the word appear overall…

Rank these, and you have a list of the words most particular to a given city

Page 69: Mining Social Data for Fun and Insight

ResultsNew YorkMets

Lounges

Offense

Desires

Musical

Submissive

Create

Song

Oral

BostonPink

Sox

Poetry

Intellectually

Punk

Appreciation

Exercise

Winter

Education

ChicagoCubs

Burbs

Bears

Girlie

Insecure

Cheat

Importance

Blunt

Mouth

Page 70: Mining Social Data for Fun and Insight

Results

Los AngelesExcellent

Vegas

Meaningful

Star

Lame

Industry

Heat

Fitness

Entertainment

Latino

San FranciscoTee

Employment

Picnic

STD

Tasting

Hikes

French

.com

Kayaking

Cycling

Page 71: Mining Social Data for Fun and Insight

Newsgroup Discussion

Page 72: Mining Social Data for Fun and Insight

Overlapping themes

Page 73: Mining Social Data for Fun and Insight

Themes in a document

Page 74: Mining Social Data for Fun and Insight

Another word matrix

Msg1 Msg2 Msg3 Msg4 Msg5

Gym 2 0 0 3 0

Calorie 0 2 1 1 3

Weigh 1 0 2 0 0

Carbs 0 3 0 0 2

Treadmill 1 0 0 2 0

Actual Matrix

Page 75: Mining Social Data for Fun and Insight

Weights and features

Msg1 M2 M3 M4 M5F1 1 0 2 3 0F2 0 2 1 1 3F3 1 0 2 0 0

F1 F2 F3Gym 0 1 2Calorie 2 0 1Weigh 2 2 1Carbs 1 0 3Treadmill 0 1 2

Features Matrix

Weight Matrix

x

Page 76: Mining Social Data for Fun and Insight

Matrix factorization

Msg1 Msg2 Msg3 Msg4 Msg5

Gym 1 3 3 0 1

Calorie 0 2 4 1 3

Weigh 2 3 1 0 1

Carbs 0 1 1 0 2

Treadmill 3 2 0 2 2

Msg1 M2 M3 M4 M5

F1 1 0 2 3 0

F2 0 2 1 1 3

F3 1 0 2 0 0

F1 F2 F3

Gym 0 1 2

Calorie 2 0 1

Weigh 2 2 1

Carbs 1 0 3

Treadmill 0 1 2

Features MatrixWeight Matrix

x

Current Guess

Page 77: Mining Social Data for Fun and Insight

Matrix factorizationMatrix factorization

Msg1 M2 M3 M4 M5

F1 1 0 2 3 0

F2 0 2 1 1 3

F3 1 0 2 0 0

F1 F2 F3

Gym 0 1 2

Calorie 2 0 1

Weigh 2 2 1

Carbs 1 0 3

Treadmill 0 1 2

Features MatrixWeight Matrix

x

Msg1 Msg2 Msg3 Msg4 Msg5

Gym 2 0 0 3 0

Calorie 0 2 1 1 3

Weigh 1 0 2 0 0

Carbs 0 3 0 0 2

Treadmill 1 0 0 2 0

Target Result

Msg1 Msg2 Msg3 Msg4 Msg5

Gym 1 3 3 0 1

Calorie 0 2 4 1 3

Weigh 2 3 1 0 1

Carbs 0 1 1 0 2

Treadmill 3 2 0 2 2

Current Guess

Page 78: Mining Social Data for Fun and Insight

Matrix factorizationMatrix factorization

Msg1 M2 M3 M4 M5

F1 2 0 0 1 0

F2 0 2 0 1 3

F3 1 0 1 0 0

F1 F2 F3

Gym 1 0 0

Calorie 0 1 1

Weigh 0 0 2

Carbs 0 1 0

Treadmill 1 0 0

Features MatrixWeight Matrix

x

Msg1 Msg2 Msg3 Msg4 Msg5

Gym 2 0 0 3 0

Calorie 0 2 1 1 3

Weigh 1 0 2 0 0

Carbs 0 3 0 0 2

Treadmill 1 0 0 2 0

Target Result

Msg1 Msg2 Msg3 Msg4 Msg5

Gym 2 0 0 3 0

Calorie 0 2 1 1 3

Weigh 1 0 2 0 0

Carbs 0 3 0 0 2

Treadmill 1 0 0 2 0

Current Guess

Page 79: Mining Social Data for Fun and Insight

Interpreting Features

Msg1 M2 M3 M4 M5

F1 2 0 0 1 0

F2 0 2 0 1 3

F3 1 0 1 0 0

F1 F2 F3

Gym 1 0 0

Calorie 0 1 1

Weigh 0 0 2

Carbs 0 1 0

Treadmill 1 0 0

Features Matrix

Weight Matrix

Theme 1 Theme 2 Theme 3

Gym Calorie Weigh

Treadmill Carbs Calorie

Msg1 Msg2 Msg3 etc.

Theme 1 Theme 2 Theme 3

Theme 3

Page 80: Mining Social Data for Fun and Insight

“Diet and body” themes

AtkinsInductionSouthBeachCarbs

ChocolateBlackCoffeeOliveBroccoli Gym

WeightsExerciseRunningInjured

CookRecipeFriedHome Money

OrganicWantBest

CaloriesWeightFatsProteinCholesterol

Page 81: Mining Social Data for Fun and Insight

Side note: NMF for faces

Page 82: Mining Social Data for Fun and Insight

Methods covered

Regression treesHierarchical clusteringk-means clusteringMultidimensional scalingBayesian ClassifierNon-negative Matrix Factorization

Page 83: Mining Social Data for Fun and Insight

Other ideas

FinanceAnalysts already drowning in infoStories sometimes broken on blogsMessage boards show sentiment

Extremely low signal-to-noise ratio

Page 84: Mining Social Data for Fun and Insight

Other ideas

Product problems/ideasUse support message boardsExtract themesUnderstand recurring issuesLearn what features people want

Page 85: Mining Social Data for Fun and Insight

Other ideas

EntertainmentHow much buzz is a movie generating?What psychographic profiles like this type of movie?

Of interest to studios and media investors