foursquare - ml presentation

43
3/22/2011 Machine Learning Meetup Jus6n Moore @injust Ma;hew Rathbone @rathboma Big Data @ foursquare Infrastructure, Analy6cs, Predic6on, and Beyond

Upload: justin-moore

Post on 26-Mar-2015

37.272 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Foursquare - ML Presentation

3/22/2011  Machine  Learning  Meetup  Jus6n  Moore  -­‐  @injust  

Ma;hew  Rathbone  -­‐  @rathboma  

Big Data @ foursquare

Infrastructure,  Analy6cs,  Predic6on,  and  Beyond  

Page 2: Foursquare - ML Presentation

3/22/2011  Machine  Learning  Meetup  Jus6n  Moore  -­‐  @injust  

Ma;hew  Rathbone  -­‐  @rathboma  

Overview  

•  What  is  foursquare  •  Analy6cs  and  Data  •  Machine  Learning,  Recommenda6ons  

Page 3: Foursquare - ML Presentation

3/22/2011  Machine  Learning  Meetup  Jus6n  Moore  -­‐  @injust  

Ma;hew  Rathbone  -­‐  @rathboma  

What  is  Foursquare?  •  Loca6on  based  startup,  applica6on  that  helps  you  to  explore  your  city,  discover  new  places  

•  Visit  places,  check-­‐in,  earn  rewards,  stay  connected  with  your  friends  

•  Game  elements:  single-­‐player,  mul6-­‐player  

Page 4: Foursquare - ML Presentation

3/22/2011  Machine  Learning  Meetup  Jus6n  Moore  -­‐  @injust  

Ma;hew  Rathbone  -­‐  @rathboma  

What  is  Foursquare?  (cont.)  

•  7M+  users,  15M+  venues,  500M+  check-­‐ins  

•  Large  reach  (every  country,  North  Pole,  Space,  Everest)  

•  Na6ve  app  for  almost  every  smartphone,  also  available  on  SMS,  web,  mobile-­‐web  

Page 5: Foursquare - ML Presentation

3/22/2011  Machine  Learning  Meetup  Jus6n  Moore  -­‐  @injust  

Ma;hew  Rathbone  -­‐  @rathboma  

Explore  

•  Our  new  social-­‐recommenda6on  engine  

•  Real-­‐6me  sugges6ons  based  on  your  social  graph.  

Page 6: Foursquare - ML Presentation

3/22/2011  Machine  Learning  Meetup  Jus6n  Moore  -­‐  @injust  

Ma;hew  Rathbone  -­‐  @rathboma  

Data  Model  

Users   Venues  

Tips/To-­‐dos  

Check-­‐ins  

Shouts  

Page 7: Foursquare - ML Presentation

3/22/2011  Machine  Learning  Meetup  Jus6n  Moore  -­‐  @injust  

Ma;hew  Rathbone  -­‐  @rathboma  

Analy6cs  @  Foursquare  

I’m  going  to  talk  about:  •  Why  produc6on  db’s  are  bad  for  analy6cs  •  What  we  do  to  make  it  be;er  (hint:  hadoop)  •  Our  custom  Dashboard  •  Usage  examples  •  Thoughts  about  the  hadoop/hive  experience  

Page 8: Foursquare - ML Presentation

3/22/2011  Machine  Learning  Meetup  Jus6n  Moore  -­‐  @injust  

Ma;hew  Rathbone  -­‐  @rathboma  

Our  Data:    Problems  using  the  Produc6on  

Databases  

Page 9: Foursquare - ML Presentation

3/22/2011  Machine  Learning  Meetup  Jus6n  Moore  -­‐  @injust  

Ma;hew  Rathbone  -­‐  @rathboma  

Page 10: Foursquare - ML Presentation

3/22/2011  Machine  Learning  Meetup  Jus6n  Moore  -­‐  @injust  

Ma;hew  Rathbone  -­‐  @rathboma  

Page 11: Foursquare - ML Presentation

3/22/2011  Machine  Learning  Meetup  Jus6n  Moore  -­‐  @injust  

Ma;hew  Rathbone  -­‐  @rathboma  

Our  data:  So  we  turn  to  our  friends  

Our  repor6ng  /  analy6cs  /  data  mining  stack  is  thanks  to  open  source  sobware  

Page 12: Foursquare - ML Presentation

3/22/2011  Machine  Learning  Meetup  Jus6n  Moore  -­‐  @injust  

Ma;hew  Rathbone  -­‐  @rathboma  

Our  data:  What  we  do  instead  

Log  Files  

Page 13: Foursquare - ML Presentation

3/22/2011  Machine  Learning  Meetup  Jus6n  Moore  -­‐  @injust  

Ma;hew  Rathbone  -­‐  @rathboma  

About  Hadoop  and  Hive  Hadoop:  •  Distributed  Data  processing  

framework  (map-­‐reduce).  •  Wri;en  in  Java    Hive:  •  SQL  layer  on  top  of  hadoop  •  Lets  us  do  “select  count(1)  

from  checkins”  instead  of  having  to  write  our  own  map-­‐reduce  java  classes.  

Image  from  ibm.com  

Page 14: Foursquare - ML Presentation

3/22/2011  Machine  Learning  Meetup  Jus6n  Moore  -­‐  @injust  

Ma;hew  Rathbone  -­‐  @rathboma  

About  Hive  

•  Create/Drop/Insert/Select  etc  •  Table  Joins  •  Aggrega6on  Func6ons  •  Date  Func6ons  •  URL  parsing  func6ons  •  Cool  n-­‐gram  func6ons  •  Just  now  gegng  database  drivers  for  popular  languages  (JAVA)  

Page 15: Foursquare - ML Presentation

3/22/2011  Machine  Learning  Meetup  Jus6n  Moore  -­‐  @injust  

Ma;hew  Rathbone  -­‐  @rathboma  

About  Hive  

Select  *  from  x;  Select  count(1)  from  x;  Select  sum(x.price)  from  x;  Select  a,  sum(price)  from  x  group  by  a;  Select  a  from  x  where  datediff(‘2011-­‐01-­‐01’,  d)  =  0;  Drop  table  x;  

Page 16: Foursquare - ML Presentation

3/22/2011  Machine  Learning  Meetup  Jus6n  Moore  -­‐  @injust  

Ma;hew  Rathbone  -­‐  @rathboma  

Hadoop  vs  Hive  SELECT  

 created_date,    country,    count(1)  

FROM  checkins  GROUP  BY  

 created_date,    country  

#mapper:  $stdin.each  do  |line|  

 date,  country,  id  =  line.split    puts  date  +  “,”  +  country  

end  #reducer  counts  =  Hash.new(0)  $stdin.each  do  |line|  

 counts[line]  +=  1  end  puts  counts  

VS  

Page 17: Foursquare - ML Presentation

3/22/2011  Machine  Learning  Meetup  Jus6n  Moore  -­‐  @injust  

Ma;hew  Rathbone  -­‐  @rathboma  

Our  Hadoop  Infrastructure  

•  We  use  clusters  generated  through  amazon’s  Elas6c  MapReduce  

•  That  means  we  store  all  of  our  data  in  flat  files  in  Amazon  S3  (which  keeps  things  simple)  

•  We  export  data  from  both  MongoDB  and  h;p  proxy  log-­‐files  

•  We  manage  everything  using  a  custom  ruby-­‐on-­‐rails  dashboard  

“rake  cluster:start[30]”  =>  starts  a  30  node  cluster,  just  like  that  

Page 18: Foursquare - ML Presentation

3/22/2011  Machine  Learning  Meetup  Jus6n  Moore  -­‐  @injust  

Ma;hew  Rathbone  -­‐  @rathboma  

Our  Dashboard  •  Define  and  schedule  reports  through  it  

•  Allow  ad-­‐hoc  access  to  (internal)  users  

•  Controls  data  imports  into  S3  from  mongo/log-­‐files  

•  Provides  an  intermediate  DB  layer  for  rollup  data  caching(experimental  atm)  

•  Allows  you  to  do  a  bunch  of  cool  stuff  with  queries  aber  they’ve  run  

Page 19: Foursquare - ML Presentation

3/22/2011  Machine  Learning  Meetup  Jus6n  Moore  -­‐  @injust  

Ma;hew  Rathbone  -­‐  @rathboma  

Example:  Impor6ng  Data  

Page 20: Foursquare - ML Presentation

3/22/2011  Machine  Learning  Meetup  Jus6n  Moore  -­‐  @injust  

Ma;hew  Rathbone  -­‐  @rathboma  

Example:  Query  Walkthrough  Find  top  20  venues  in  Switzerland  

venuename   city   total  zurich  airport  (zrh)   kloten   3746  geneva-­‐cointrin  airport  (gva)   grand-­‐saconnex   3012  zurich  hauptbahnhof   zurich   1780  sony  ericsson  football  hotspot   basel   773  basel  bahnhof  sbb   basel   761  gare  de  cornavin   geneva   760  bern  hauptbahnhof   bern   736  gare  de  lausanne   lausanne   672  apple  store   zurich   670  bahnhof  luzern   luzern   477  terminal  e   kloten   458  bellevueplatz   zurich   457  terminal  a   kloten   455  bahnhof  oerlikon   zurich   453  bahnhof  stadelhofen   zurich   444  sihlcity   zurich   400  zurich  flughafen  bahnhof   zurich   400  bahnhof  olten   olten   391  bahnhof  winterthur   winterthur   379  bahnhof  hardbrücke   zurich   369  

Page 21: Foursquare - ML Presentation

3/22/2011  Machine  Learning  Meetup  Jus6n  Moore  -­‐  @injust  

Ma;hew  Rathbone  -­‐  @rathboma  

Walkthrough:  Start  the  query  

Page 22: Foursquare - ML Presentation

3/22/2011  Machine  Learning  Meetup  Jus6n  Moore  -­‐  @injust  

Ma;hew  Rathbone  -­‐  @rathboma  

Walkthrough:  Get  the  results  in  email  

Page 23: Foursquare - ML Presentation

3/22/2011  Machine  Learning  Meetup  Jus6n  Moore  -­‐  @injust  

Ma;hew  Rathbone  -­‐  @rathboma  

Walkthrough:  Top  Venues  

Page 24: Foursquare - ML Presentation

3/22/2011  Machine  Learning  Meetup  Jus6n  Moore  -­‐  @injust  

Ma;hew  Rathbone  -­‐  @rathboma  

Walkthrough  

If  we  want  to  schedule  something  to  run  daily/weekly/monthly  we  can  do  that  too    Reports  are  represented  as  Ac6veRecord  models  

Page 25: Foursquare - ML Presentation

3/22/2011  Machine  Learning  Meetup  Jus6n  Moore  -­‐  @injust  

Ma;hew  Rathbone  -­‐  @rathboma  

Walkthrough:  Reports  feed  our  dashboards  

Page 26: Foursquare - ML Presentation

3/22/2011  Machine  Learning  Meetup  Jus6n  Moore  -­‐  @injust  

Ma;hew  Rathbone  -­‐  @rathboma  

Walkthrough:  queries  allow  data  explora6on  

Page 27: Foursquare - ML Presentation

3/22/2011  Machine  Learning  Meetup  Jus6n  Moore  -­‐  @injust  

Ma;hew  Rathbone  -­‐  @rathboma  

Stats  on  the  Stats  Stack  

•  25-­‐machine  clusters  •  Reports  on  check-­‐in  data  (joining  venues  and/or  users)  usually  take  5-­‐15  minutes  to  run  

•  Reports  on  log  data  usually  take  10-­‐20  minutes  to  run  

•  We  run  10-­‐30  reports  a  day  •  Most  data  goes  into  a  Google  spreadsheet  for  people  to  look  at.  

Page 28: Foursquare - ML Presentation

3/22/2011  Machine  Learning  Meetup  Jus6n  Moore  -­‐  @injust  

Ma;hew  Rathbone  -­‐  @rathboma  

Thoughts  on  Amazon’s  EMR  

•  The  API  has  very  low  rate  limits  •  Everything  is  a  HTTP  get  request  (even  crea6ng  a  cluster)  

•  The  ruby  library/client  is  unusable  as  a  client  library.  (we  shell  out  to  it  in  order  to  capture  the  resul6ng  JSON)  

Page 29: Foursquare - ML Presentation

3/22/2011  Machine  Learning  Meetup  Jus6n  Moore  -­‐  @injust  

Ma;hew  Rathbone  -­‐  @rathboma  

Thoughts  on  Hive  

•  Generally  good  •  Some6mes  it  will  act  crazy  •  Par66oning  data  is  harder  than  it  looks  •  The  JSON  serde  makes  all  sorts  of  weird  stuff  happen  when  you’re  joining  tables  

•  Always  join  LAST!  

Page 30: Foursquare - ML Presentation

3/22/2011  Machine  Learning  Meetup  Jus6n  Moore  -­‐  @injust  

Ma;hew  Rathbone  -­‐  @rathboma  

Working  With  Hive  SELECT    

 v.venuename,      count(*)  

FROM    checkins  c    JOIN  venues  v    ON  c.venueid  =  v.id  

GROUP  BY  v.address  

SELECT      v.venuename,    c.total  

FROM    (SELECT      venueid,        count(1)    FROM  checkins    GROUP  BY  venueid    )  c    JOIN  venues  v      on  c.venueid  =  v.id  

OK   BETTER  

Page 31: Foursquare - ML Presentation

3/22/2011  Machine  Learning  Meetup  Jus6n  Moore  -­‐  @injust  

Ma;hew  Rathbone  -­‐  @rathboma  

Our  Data:  End  

•  Hadoop  +  Hive  >  Mongo  +  Scripts  

•  Simple  ruby  dashboard  ==  super  useful  

•  Lots  of  data  ==  fun  charts  

QUESTIONS?  

Page 32: Foursquare - ML Presentation

3/22/2011  Machine  Learning  Meetup  Jus6n  Moore  -­‐  @injust  

Ma;hew  Rathbone  -­‐  @rathboma  

foursquare  3.0:  Explore  

Page 33: Foursquare - ML Presentation

3/22/2011  Machine  Learning  Meetup  Jus6n  Moore  -­‐  @injust  

Ma;hew  Rathbone  -­‐  @rathboma  

Engineering  an  Online  Recommenda6on  System  

Page 34: Foursquare - ML Presentation

3/22/2011  Machine  Learning  Meetup  Jus6n  Moore  -­‐  @injust  

Ma;hew  Rathbone  -­‐  @rathboma  

Engineering  cont.  

Goals:  •  “Here  and  now”  •  No  new  signals  •  Use  all  of  our  textual  data  •  100ms  per  query  

Page 35: Foursquare - ML Presentation

3/22/2011  Machine  Learning  Meetup  Jus6n  Moore  -­‐  @injust  

Ma;hew  Rathbone  -­‐  @rathboma  

Engineering  cont.  Pain  points:  •  Geo  indexes,  compound  geo  indexes  

•  Limi6ng  queries  in  minimally  impac�ul  ways  

•  Cached  datastores  (building  rollup  collec6ons)  

•  Geo  indexes  

Page 36: Foursquare - ML Presentation

3/22/2011  Machine  Learning  Meetup  Jus6n  Moore  -­‐  @injust  

Ma;hew  Rathbone  -­‐  @rathboma  

Compu6ng  a  Similarity  Matrix  

•  Analyzing  similarity  func6ons  OK  on  single  machine  

•  10M+  venues  =  100  trillion  element  sparse  matrix  – Compute  without  visi6ng  every  element  – Parallelize,  cross  machine  

Page 37: Foursquare - ML Presentation

3/22/2011  Machine  Learning  Meetup  Jus6n  Moore  -­‐  @injust  

Ma;hew  Rathbone  -­‐  @rathboma  

Compute  Similarity  Matrix,  cont.  

•  Leverage  Mahout’s  library  of  similarity  func6ons,  easy  to  extend  

•  Job  system  controls  execu6on  of  sequen6al  dependent  M-­‐R  tasks  

•  Hadoop:  easily  scalable  to  large  commodity  machine  clusters,  elas6c  makes  increasing  cluster  size  trivial  

Page 38: Foursquare - ML Presentation

3/22/2011  Machine  Learning  Meetup  Jus6n  Moore  -­‐  @injust  

Ma;hew  Rathbone  -­‐  @rathboma  

Compute  Similarity  Matrix,  cont.  

Series  of  “Jobs,”  each  do  a  Map-­‐Reduce  1.  Convert  input  flat  file  dumped  from  Hive  to  binary  sparse  

vector  representa6on  2.  Compute  pairwise  co-­‐occurrences  3.  Compute  column  based  weights  (column  normaliza6on),  

retrieve  all  vectors  with  co-­‐occurrences  4.  Compute  pairwise  similari6es,  store  in  sparse  matrix  

format  5.  Fla;en  sparse  matrix  to  text  format  that  we  can  load  

into  DB  

Page 39: Foursquare - ML Presentation

3/22/2011  Machine  Learning  Meetup  Jus6n  Moore  -­‐  @injust  

Ma;hew  Rathbone  -­‐  @rathboma  

The  Value  of  Why  

•  Show  people  which  friends  visited,  which  places  are  co-­‐visited  (not  the  same  as  “similar”?)  

•  Lowers  the  bar  for  precision  –  Allows  users  to  choose  for  themselves  among  recs  –  Increase  propensity  to  check-­‐in  (sales  pitch  for  the  

venue)    •  Mix  with  the  social,  story-­‐telling  aspects  of  

product  •  Collabora6ve  filtering  allows  for  easy  descrip6on  

Page 40: Foursquare - ML Presentation

3/22/2011  Machine  Learning  Meetup  Jus6n  Moore  -­‐  @injust  

Ma;hew  Rathbone  -­‐  @rathboma  

Case  Study:  Defining  “Interes6ng”  

•  Need  to  show  ranked  venues  for  “cold-­‐start”  •  Various  influencing  factors  in  what  makes  a  place  “interes6ng”  

–  Number  of  users  checked  in  –  Average  visits  per  user  –  Tips  leb,  to-­‐dos  done  –  How  people  check-­‐in  (broadcast  to  T/FB,  off-­‐the-­‐grid?)  –  Trending  direc6on  (more  popular  lately?)  

•  Measuring  raw  popularity  poses  problems  –  Places  open  just  for  lunch,  smaller  dining  rooms,  longer  meal  6mes  –  Been  in  system  longer,  opened  recently  –  Differences  between  categories  (coffee  shops  !=  burger  joints)  

Page 41: Foursquare - ML Presentation

3/22/2011  Machine  Learning  Meetup  Jus6n  Moore  -­‐  @injust  

Ma;hew  Rathbone  -­‐  @rathboma  

Defining  “Interes6ng”  cont.  

“Local  Favorite”  

“Must  See”  

0  

1  

2  

3  

4  

5  

6  

7  Visits  Per  User  

Unique  Users  

Page 42: Foursquare - ML Presentation

3/22/2011  Machine  Learning  Meetup  Jus6n  Moore  -­‐  @injust  

Ma;hew  Rathbone  -­‐  @rathboma  

Future  Direc6ons  

•  S6ll  a  big  unknown,  collect  user  feedback  to  drive  development  

•  Scale  beyond  just  co-­‐occurrences,  improve  predic6on  in  new  territory  

•  Planning  mode  (beyond  the  here  and  now)  •  Joint  recommenda6ons  (where  do  I  go  with  this  set  of  friends?)  

Page 43: Foursquare - ML Presentation

3/22/2011  Machine  Learning  Meetup  Jus6n  Moore  -­‐  @injust  

Ma;hew  Rathbone  -­‐  @rathboma  

Help  us  get  there  

foursquare  is  hiring  www.foursquare.com/jobs  

   Jus6n  Moore  

@injust  [email protected]  

Ma;hew  Rathbone  @rathboma  

ma;[email protected]