hadoop @ foursquare

54
Hadoop @ Foursquare Joe Crobak Software Engineer @joecrobak Blake Shaw, PhD Data Scientist @metablake

Upload: foursquarehq

Post on 24-Oct-2014

38.761 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Hadoop @ Foursquare

Hadoop @ Foursquare

Joe CrobakSoftware Engineer@joecrobak

Blake Shaw, PhDData Scientist@metablake

Page 2: Hadoop @ Foursquare
Page 3: Hadoop @ Foursquare

An  app  that  helps  you  explore  your  city  and  connect  with  friends

A  pla5orm  for  loca7on  based  services  and  data

What is Foursquare?

Page 4: Hadoop @ Foursquare

People  use  foursquare  to:

–  share  with  friends–  discover  new  places–  get  7ps–  get  deals–  earn  points  and  badges–  keep  track  of  visits

What is Foursquare?

Page 5: Hadoop @ Foursquare

Mobile Social

Local

What is Foursquare?

Page 6: Hadoop @ Foursquare

20,000,000+  people30,000,000+  places2,000,000,000+  check-­‐ins1500+  ac7ons/second

Stats

Page 8: Hadoop @ Foursquare

• Intro  to  Foursquare  Data•Mining  Signals  from  Check-­‐ins

•Data  Pipeline•Data  Infrastructure

• Past

• Present

• Future

Overview

Page 9: Hadoop @ Foursquare

Explore

A  social  recommenda7on  engine  built  from  check-­‐in  data

Page 10: Hadoop @ Foursquare

Central  Park JFK

What is a place?

Page 11: Hadoop @ Foursquare

Time signatures for places

Page 12: Hadoop @ Foursquare

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan30

40

50

60

70

80

90

100

Tem

pera

ture

(f)

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan

0.3

0.4

0.5

0.6

0.7

Month in 2011 (New York)

% o

f che

ckin

s

ice cream shops

Ice cream?

Page 13: Hadoop @ Foursquare

Warm  weather  spots

– ice  cream  shops

– roof  decks

– boats  or  ferries

– harbors  or  marinas

– sculpture  gardens

– tracks

– basketball  courts

– parks

Cold  weather  spots

– lakes

– basketball  stadiums

– hockey  stadiums

– art  galleries

– ska7ng  rinks

– bou7ques

– steakhouses

– ramen  or  noodle  house

Check-ins and the weather

Page 14: Hadoop @ Foursquare
Page 15: Hadoop @ Foursquare

•Critical for our recommendation engine•Large sparse k-nearest neighbor problem– Items can be places, people, brands–Different distance metrics– Need to exploit sparsity otherwise

intractable

Finding similar items

Page 16: Hadoop @ Foursquare

•Metrics  we  find  work  best  for  recommending:

– Places:  cosine  similarity

– Friends:  intersec7on

– Brands:  Jacaard  similarity

sim(xi,xj) =xixj

kxikkxjk

sim(A,B) = |A \B|

sim(A,B) = |A\B||A[B|

Finding similar items

Page 17: Hadoop @ Foursquare

X 2 Rn⇥d

each  entry  is  the  log(#  of  checkins  at  place  i  by  user  j)

one  row  for  every  30m  venues...

K 2 Rn⇥n

Kij = sim(xi,xj)

=xixj

kxikkxjk

Computing venue similarity

Page 18: Hadoop @ Foursquare

K 2 Rn⇥n

Kij = sim(xi,xj)

=xixj

kxikkxjk

•  Naive  solu7on  for  compu7ng            :

•Requires  ~4.5m  machines  to  compute  in  <  24  hours!!!  and  3.6PB  to  store!

K

O(n2d)

Computing venue similarity

Page 19: Hadoop @ Foursquare

Venue similarity w/ map reduce

visited venues map

reduce

vi, vj

vi, vj

vi, vj

user

emit  “all”  pairs  of  visited  venues  for  each  user

Sum  up  each  user’s  score  contribu7on  to  this  pair  of  venues

score

score

key

key

score scorescore ...

...

final score

Page 20: Hadoop @ Foursquare

• Intro  to  Foursquare  Data•Mining  Signals  from  Check-­‐ins

•Data  Pipeline•Data  Infrastructure

• Past

• Present

• Future

Overview

Page 21: Hadoop @ Foursquare

Data pipeline - stats1,500,000,000+ log events / week

2,500,000,000,000+ bytes / week

GB  /  day  (compressed)for  api  collec7on

May  2011 May  2012Nov  2011

Page 22: Hadoop @ Foursquare

Data pipeline - stats

And lots of people are using it!– 100+  hive  users.– several  users  with  >  100  jobs.– ~  700  MR  jobs  /  day.

Page 23: Hadoop @ Foursquare

Data pipeline - stats

Page 24: Hadoop @ Foursquare

Data pipeline - background

Foursquare’s technology stack– Amazon  EC2–MongoDB– Solr  /  elas7csearch– Scala• Lij  web  framework

– Flume  (0.9.x  aka  old-­‐gen)– Amazon  S3

Page 25: Hadoop @ Foursquare

Data pipeline - overview

API / WWW

Flume

CollectorJSON

S3

.../collection-name/dt=2012-06-19/...

Hive MapReduce

mongodb

Export Process

Page 26: Hadoop @ Foursquare

Data pipeline - logs

Applications log JSON– some  common  fields  (e.g.  event  id,  7mestamp,  host)– data  is  par77oned  by  collec7on  and  date  in  S3.– one  table  per  collec7on  in  Hive.

Flume for data transport.

API / WWW

Flume

CollectorJSON

S3

.../collection-name/dt=2012-06-19/...

Page 27: Hadoop @ Foursquare

Data pipeline - mongodbMongo data is nice to work with in MapReduce– info  in  logs  can  be  stale.– certain  aIributes  not  in  logs.– can  scan  much  less  data.

MapReduce against production mongo cluster would degrade performance and/or cause denial-of-service.

S3

mongodb

Export Process

Page 28: Hadoop @ Foursquare

Data pipeline - analytics

Automated reporting– typically  a  Hive  query  -­‐>  google  docs  spreadsheet.

Ad hoc reporting– hive  dashboard  for  entering  query  and  receiving  an  email  when  results  are  ready.

– RoR

Page 29: Hadoop @ Foursquare

Data pipeline - beekeeper

Page 30: Hadoop @ Foursquare

Data pipeline - Summary

Log data and snapshots of mongo data are stored in S3.

Users query/analyze the data using Hive, Pig, and MapReduce.

Compiled data is inserted to mongo or google spreadsheets for reporting.

Page 31: Hadoop @ Foursquare

• Intro  to  Foursquare  Data•Mining  Signals  from  Check-­‐ins

•Data  Pipeline•Data  Infrastructure

• Past

• Present

• Future

Overview

Page 32: Hadoop @ Foursquare

Data infrastructure - past

Elastic MapReduce– but  we  were  keeping  clusters  con7nuously  running.

Rudimentary workflow management– start  daily  repor7ng  at  7me  X.  Hope  that  data  is  there.

– difficult  to  monitor.

Page 33: Hadoop @ Foursquare

Data infrastructure - past

Scaling the number of users was troublesome– most  of  the  company  uses  hive.– hive  server  con7nuously  crashed.– lots  of  memory  issues.– resource  conten7on.

Mongo data– converted  to  delimited  records,  which  doesn’t  always  make  sense.– incremental  dumps  -­‐  some  data  not  consistent  (e.g.  if  two  venues  are  

merged).– basic  schema  detec7on.– single  threaded  per-­‐collec7on.

Page 34: Hadoop @ Foursquare

Data infrastructure - past

Hive and EMR “flows” supported for automated reporting.

lots of mapreduce tools written in ruby– everything  else  is  scala• want  to  use  common  u7li7es

– installing  gems  on  system  is  briIle

Page 35: Hadoop @ Foursquare

• Intro  to  Foursquare  Data•Mining  Signals  from  Check-­‐ins

•Data  Pipeline•Data  Infrastructure

• Past

• Present

• Future

Overview

Page 36: Hadoop @ Foursquare

Data infrastructure - present

Introduced a lot of new systems

– Cloudera’s  Hadoop  Distro  -­‐  CDH3u3

– Oozie  for  workflow  /  data  management

– Pig  for  repor7ng– Scaled  back  ruby  /  hive  

dashboard– BSON  mongo  dumps– Scala  MapReduce– Scoobi

36

Page 37: Hadoop @ Foursquare

Data infrastructure - CDH3u3

12-node cluster in EC2 on cc8.xlarge instances– data  is  in  S3– fair  scheduler  (jobs  run  as  submipng  user)– performance  improvements• skew  means  slowest  reducer  ojen  defines  wall-­‐clock  7me

– signs  of  virtualiza7on– cpu  bound  (data  compression)

Page 38: Hadoop @ Foursquare

Data infrastructure - oozie

Pros– beIer  monitoring  (though  not  perfect).– coordinators  for  dataset  management  are  great.– oozie  distributes  job  submission  via  map  tasks.– SPoF  but  recovers  ajer  a  restart  (state  stored  in  DB).

Cons– deployment  is  not  ideal,  it’s  difficult  to  version  workflows.– configura7on  via  XML  -­‐  lots  of  boilerplate

• we  have  a  scaffolding  script  to  bootstrap  a  workflow.

Page 39: Hadoop @ Foursquare

Oozie coordinator (the good)S3 / HDFS

Does data exist yet?

Dataset Instance

Workflow F

Dataset A

Depends On

Coordinator

Yes? Kickoff workflow!

Page 40: Hadoop @ Foursquare

Oozie XML (the bad)

Hello World in Oozie– just  invoking  HelloWorld#main

Page 41: Hadoop @ Foursquare

Pig for reporting

Converted some ruby streaming to Pig + Scala UDFs.

More natural than Hive for some reports, especially those that output to multiple locations.

Elephant-Bird (twitter), Piggybank (apache), Data-fu (LinkedIn) all great UDF resources.

Page 42: Hadoop @ Foursquare

Ad hoc reporting dashboard

Uses hive thrift server to validate syntax (via EXPLAIN)

Submits jobs as Oozie workflows.

– The  query  is  a  parameter  to  the  workflow.

– queries  run  as  the  users  that  submit  them.

42

Beekeeper

Thrift Server

$QUERY

1. EXPLAIN $QUERY

Oozie REST

2. submit workflow,query=$QUERY

3. (repeatedly)is workflow done?

Page 43: Hadoop @ Foursquare

Click to edit Master text styles

Hive dashboard - error

43

Page 44: Hadoop @ Foursquare

BSON data dumps

Full loads each day, parallelized.

mongodb’s native format is BSON.– Binary  JSON– some  extensions  to  JSON– schema-­‐less

Page 45: Hadoop @ Foursquare

BSON data infrastructure

45

BSON

BSON Split / LZO compress

BSONObject

Thrift (scala codegen)

ThriftBsonInputFormat

Recordv2 SerDe / Scoobi Input

EBS SnapshotsMongo stores BSON on EBS.

-Periodic EBS Snapshots

Oozie Workflow to mount snapshot, split data, compress, upload to S3.

BSON InputFormat converts to BSONObject

Scala Codegen converts to Thrift Object

InputFormat for Thrift objects to use in MR.

Hive SerDe and Scoobi Inputs

Page 46: Hadoop @ Foursquare

Scooby

Not  that  Scooby!

Page 47: Hadoop @ Foursquare

Scoobi

A strongly-typed data flow language written in Scala.

Much easier than writing MapReduce, but still very flexible.

https://github.com/nicta/scoobi

Page 48: Hadoop @ Foursquare

Scoobi Example

Counting checkins

Page 49: Hadoop @ Foursquare

Data infrastructure - Data Joins

Joins in MapReduce are cumbersome.– Do  them  once!

Page 50: Hadoop @ Foursquare

Data infrastructure - Data JoinsCheckinsVenue Tips LikesCheckinsCheckins TipsTips LikesLikes

Checkins Tips LikesCheckinsCheckins TipsTips LikesLikesCheckins Tips LikesCheckinsCheckins TipsTips LikesLikes

Data Join

CheckinsVenue Tips LikesCheckinsCheckins TipsTips LikesLikes

Page 51: Hadoop @ Foursquare

• Intro  to  Foursquare  Data•Mining  Signals  from  Check-­‐ins

•Data  Pipeline•Data  Infrastructure

• Past

• Present

• Future

Overview

Page 52: Hadoop @ Foursquare

Future Work

HCatalog– Makes  Hive  tables  (including  input  formats  and  serdes)  available  to  Pig  and  

MapReduce– Add  support  for  Scoobi

Indexing/Hive-indexing

Relational / MPP database for analytics dashboarding

Key-value store for easily serving hadoop data in prod.

Replacing Flume 0.9.4

Page 53: Hadoop @ Foursquare

Open Source

Let us know what you might find useful

Page 54: Hadoop @ Foursquare

Join us!foursquare is hiring! 115+ people and growing

foursquare.com/jobsJoe CrobakSoftware Engineer@joecrobak

Blake Shaw, PhDData Scientist@metablake