diving in the deep end of the big data pool
TRANSCRIPT
Diving In The Deep End of the Big Data Pool
François Garillot@huitseeker
17:45 Thursday
Understanding your Unicorns: Data Science Team Building in Action
Location: 120-121
4 analytical PhDs
3 weeks
1 org with data
& a QUESTION
François Garillot (me)
Stephen Gadd Marisa Figueiredo
Federica Capranico
Globetrotters
Family
Entertainment
SMB
Sport
Music Festivals
Football Fans
In Car Market Buyers
Pet Owners
Technology
Drivers
Mums Preschool
University Students
Gamblers
Mums
Shoppers
Music
Zone 1 commuters Infrequent
Zone 1 commuters Freq.Zone 1 commuters Resident
Zone 1 commuters Regular
Zone 1 commuters
Entertainment FilmsFood Coffee Shops
Gamers
Autos
B2B
Business/Finance
Careers
Education
Entertainment
Family & Youth
Gambling
Gaming
IT
Lifestyle
News
Property
Government
Retail
Search
Social
Sport
Telco
Travel
Globetrotters
Family
Entertainment
SMB
Sport
Music Festivals
Football Fans
In Car Market Buyers
Pet Owners
Technology
Drivers
Mums Preschool
University Students
Gamblers
Mums
Shoppers
Music
Zone 1 commuters Infrequent
Zone 1 commuters Freq.Zone 1 commuters Resident
Zone 1 commuters Regular
Zone 1 commuters
Entertainment FilmsFood Coffee Shops
Gamers
Autos
B2B
Business/Finance
Careers
Education
Entertainment
Family & Youth
Gambling
Gaming
IT
Lifestyle
News
Property
Government
Retail
Search
Social
Sport
Telco
Travel
5+millions
50+ K
... so: Things Not To Mess Up
Nobody ever get those two right
unsupervised clustering
find new segments
based on web
browsing history
relative distances
spatial representation
unsupervised clustering based on web browsing history
have a position for each user
no implementation that works at scale!
find new segments
simrank
Simrank & MDS
website
websitewebsite
website
22 million nodes
123 million edges
simrank
5+ millions
25+ trillions
Clustering
Simrank & MDS
MDS: scalable but too complex to
do in time
website
websitewebsite
website
22 million nodes
123 million edges
simrank
5+ millions
MDS
Clustering
(45, 36)
✓Implemented
✖ Fail
Lay the bare stuff down first, THEN refine
Cluster stilla huge mess to deploy
Results
Singles
Locality-Sensitive Hashing
Hand-made code !
typical web browsing: pof.com, tagged.com
“The year of being single”, Marketing Magazine, 2013
“The rise of the single economy”, The Guardian, 2014
Final results obtainedon the last day
Essential : fuel & friends
- power & network fail
- Bare pipeline first
- Distributed is hard, let's go Think instead !
- Fuel & friends