social data toby segaran author, programming collective intelligence data magnate, metaweb...
TRANSCRIPT
Social DataToby Segaran
Author, Programming Collective Intelligence
Data Magnate, Metaweb Technologies
Data mining?
“Sorting through data* to identify patterns and establish
relationships”
* usually a lot of data
Where and why?
Methods and examples
Where and why?•Targeted Advertising
•Recommendations
•Search Results
•Group Discovery
•Filtering of Documents
•Theme Extraction
Google ad
Facebook ad
This is strange...
•Google just has text
•Facebook knows more about me
•But it’s taking a few cues...
Status: “engaged”
Where and why?•Targeted Advertising
•Recommendations
•Search Results
•Group Discovery
•Filtering of Documents
•Theme Extraction
Real Amazon Products
Netflix Prize
Strands Contest
Custom News
Custom News
Custom News
Where and why?•Targeted Advertising
•Recommendations
•Search Results
•Group Discovery
•Filtering of Documents
•Theme Extraction
Ranking algorithms
The now-incredibly-famous paper
Ranking algorithms
•Google begins tracking clicks in 2005
•MSN search claims neural network
•AOL Data Scandal
Learning behavior
Where and why?•Targeted Advertising
•Recommendations
•Search Results
•Group Discovery
•Filtering of Documents
•Theme Extraction
In Biology
Page Grouping
Resumes
“Can resumes be groupedinto career paths?”
Where and why?•Targeted Advertising
•Recommendations
•Search Results
•Group Discovery
•Filtering of Documents
•Theme Extraction
The obvious: spamSpamBayes
Other email uses
Web documents
“As you add information to Twine, it is automatically tagged so that you and others
can find it more easily”
Where and why?•Targeted Advertising
•Recommendations
•Search Results
•Group Discovery
•Filtering of Documents
•Theme Extraction
What is the buzz?
Customer Community
Where and why?
Methods and examples
Methods and Examples•Bayesian Filtering
•Distance Metrics
•Clustering
•Decision Trees
•Network Analysis
•Feature Extraction
Bayesian Filtering
Bayesian Filtering
Bayesian Filtering
Bayesian Filtering
Bayesian Filtering
Bayesian Filteringschoolwork
algorithm
Bayesian Filteringschoolwork
algorithm
v1agratrades
associate
Craigslist personals
Analysis
Five Cities
W4M Personal Ads
ResultsNew YorkMets
Lounges
Offense
Desires
Musical
Submissive
Create
Song
Oral
BostonPink
Sox
Poetry
Intellectually
Punk
Appreciation
Exercise
Winter
Education
ChicagoCubs
Burbs
Bears
Girlie
Insecure
Cheat
Importance
Blunt
Mouth
ResultsLos AngelesExcellent
Vegas
Meaningful
Star
Lame
Industry
Heat
Fitness
Entertainment
Latino
San FranciscoTee
Employment
Picnic
STD
Tasting
Hikes
French
.com
Kayaking
Cycling
Methods and Examples•Bayesian Filtering
•Distance Metrics
•Clustering
•Decision Trees
•Network Analysis
•Feature Extraction
Preference distance
Sarah Marshall
Leatherheads
3
3
2
3
1
5
2
5
Preference distance5
4
3
2
1
1 2 3 4 5
Preference distance5
4
3
2
1
1 2 3 4 5
1
2.23
For recommendations5
4
3
2
1
1 2 3 4 5
Prom Night: 5 Prom Night: 2?
1
2.23
For recommendations5
4
3
2
1
1 2 3 4 5
Prom Night: 5 Prom Night: 24.1
Linguistic distanceThe
Six
Degrees
Hypothesis
Experienced
It
Is
When
You
Travel
Linguistic distanceThe
Six
Degrees
Hypothesis
Experienced
It
Is
When
You
Travel
Six
Degrees
Hypothesis
Experienced
Travel
Six 3
Degrees 3
Hypothesis 1
Experienced 5
Travel 6
Linguistic distance
“china” “kids” “music” “travel” “yahoo”
Gothamist 0 3 3 3 0
GigaOM 6 0 1 4 2
QuickOnlineTips 0 2 2 0 12
O’Reilly Radar 1 0 3 6 4
Linguistic distance“china” “kids” “music” “yahoo”
Gothamist 0 3 3 0
GigaOM 6 0 1 2
Quick Online Tips 0 2 2 12
Euclidean “as the crow flies”
= 12 (approx)
Article/blog similarity
Valleywag - Huffington > Slashdot - Wired
Methods and Examples•Bayesian Filtering
•Distance Metrics
•Clustering
•Decision Trees
•Network Analysis
•Feature Extraction
Hierarchical Clustering5
4
3
2
1
1 2 3 4 5
Hierarchical Clustering5
4
3
2
1
1 2 3 4 5
Hierarchical Clustering5
4
3
2
1
1 2 3 4 5
Hierarchical Clustering5
4
3
2
1
1 2 3 4 5
Hierarchical Clustering
Grouping bloggers
Grouping bloggers
Grouping bloggers
Grouping articles
Methods and Examples•Bayesian Filtering
•Distance Metrics
•Clustering
•Decision Trees
•Network Analysis
•Feature Extraction
Decision Trees
CART AlgorithmBrand Type
Life (hrs)
Duracell
C 4
Energizer
C 5
Duracell
AA 2
Energizer
AA 2.5From any dataset...
CART AlgorithmBrand Type
Life (hrs)
Duracell
C 4
Energizer
C 5
Duracell
AA 2
Energizer
AA 2.2... find the best split ...
Type is C?
Avg=4.5Avg=2.1
No Yes
CART AlgorithmBrand Type
Life (hrs)
Duracell
C 4
Energizer
C 5
Duracell
AA 2
Energizer
AA 2.2... and repeat.
Type is C?No Yes
DuracellNo Yes
DuracellNo Yes
42.2 2 5
Hot or Not
Hot or Not
Methods and Examples•Bayesian Filtering
•Distance Metrics
•Clustering
•Decision Trees
•Network Analysis
•Feature Extraction
A networkAA
BB
CCDD
EE
FF
PageRankAA
BB
CCDD
EE
FF
1.0
1.0
1.01.0
1.0
1.0
PageRankAA
BB
CCDD
EE
FF
1.0
1.0
1.01.0
1.0
1.0D = 0.15 + .85*E/1 + .85 * F/2 + .85*B/1 = 2.275
PageRankAA
BB
CCDD
EE
FF
0.58
0.58
1.02.275
1.0
0.15
PageRankAA
BB
CCDD
EE
FF
0.58
0.58
2.081.56
0.3
0.15
PageRankAA
BB
CCDD
EE
FF
1.03
1.03
1.481.56
0.3
0.15
PageRankAA
BB
CCDD
EE
FF
0.78
0.78
1.481.34
0.3
0.15
CI FOO participants
Science papers
The paper attempts to provide an alternative method for measuring the importance of scientific
papers based on the Google's PageRank. The method is a meaningful extension of the common
integer counting of citations and is then experimented for bringing PageRank to the
citation analysis in a large citation network. It offers a more integrated picture of the
publications' influence in a specific field.
Bringing PageRank to the citation analysis
Clustering coefficient
“How many of each persons friendsare friends with each other?”
Clustering coefficient
AA
BBCC
DD
EEFF
Low clustering coefficient
Clustering coefficient
AA
BBCC
DD
EEFF
High clustering coefficient“small world graph”
Twitter!
Twitter!
Methods and Examples
•Bayesian Filtering
•Distance Metrics
•Clustering
•Decision Trees
•Network Analysis
•Feature Extraction
Independent Features
Message boards
Message boards
Matrix Factorization
Msg1 Msg2 Msg3 Msg4 Msg5
Gym 1 3 3 0 1
Calorie 0 2 4 1 3
Weigh 2 3 1 0 1
Carbs 0 1 1 0 2
Treadmill 3 2 0 2 2
Msg1 M2 M3 M4 M5
F1 1 0 2 3 0
F2 0 2 1 1 3
F3 1 0 2 0 0
F1 F2 F3
Gym 0 1 2
Calorie 2 0 1
Weigh 2 2 1
Carbs 1 0 3
Treadmill 0 1 2
Features MatrixWeight Matrix
x
Current Guess
Matrix Factorization
Msg1 M2 M3 M4 M5
F1 1 0 2 3 0
F2 0 2 1 1 3
F3 1 0 2 0 0
F1 F2 F3
Gym 0 1 2
Calorie 2 0 1
Weigh 2 2 1
Carbs 1 0 3
Treadmill 0 1 2
Features MatrixWeight Matrix
x
Msg1 Msg2 Msg3 Msg4 Msg5
Gym 2 0 0 3 0
Calorie 0 2 1 1 3
Weigh 1 0 2 0 0
Carbs 0 3 0 0 2
Treadmill 1 0 0 2 0
Target Result
Msg1 Msg2 Msg3 Msg4 Msg5
Gym 1 3 3 0 1
Calorie 0 2 4 1 3
Weigh 2 3 1 0 1
Carbs 0 1 1 0 2
Treadmill 3 2 0 2 2
Current Guess
Matrix Factorization
Msg1 M2 M3 M4 M5
F1 2 0 0 1 0
F2 0 2 0 1 3
F3 1 0 1 0 0
F1 F2 F3
Gym 1 0 0
Calorie 0 1 1
Weigh 0 0 2
Carbs 0 1 0
Treadmill 1 0 0
Features MatrixWeight Matrix
x
Msg1 Msg2 Msg3 Msg4 Msg5
Gym 2 0 0 3 0
Calorie 0 2 1 1 3
Weigh 1 0 2 0 0
Carbs 0 3 0 0 2
Treadmill 1 0 0 2 0
Target Result
Msg1 Msg2 Msg3 Msg4 Msg5
Gym 2 0 0 3 0
Calorie 0 2 1 1 3
Weigh 1 0 2 0 0
Carbs 0 3 0 0 2
Treadmill 1 0 0 2 0
Current Guess
Interpreting Features
Msg1 M2 M3 M4 M5
F1 2 0 0 1 0
F2 0 2 0 1 3
F3 1 0 1 0 0
F1 F2 F3
Gym 1 0 0
Calorie 0 1 1
Weigh 0 0 2
Carbs 0 1 0
Treadmill 1 0 0
Features Matrix
Weight Matrix
Theme 1 Theme 2 Theme 3
Gym Calorie Weigh
Treadmill Carbs Calorie
Msg1 Msg2 Msg3 etc.
Theme 1 Theme 2 Theme 3
Theme 3
Diet & Body themes
AtkinsInductionSouthBeachCarbs
ChocolateBlackCoffeeOliveBroccoli
GymWeightsExerciseRunningInjured
CookRecipeFriedHome Money
OrganicWantBest
CaloriesWeightFatsProteinCholesterol
Wikipedia peoplesheher
afterwhenfatherwomen
seriestelevision
showwhichradiobbc
leaguemajor
baseballseasonplayedwith
olympicscompeted
wonsummermedal
athelete
universityprofessorreceivedscienceresearch
born
We’re just getting started...
Homepage http://kiwitobes.com
Freebase http://freebase.com
Questions?