hadoop @ foursquare

Hadoop @ Foursquare

Joe CrobakSoftware Engineer@joecrobak

Blake Shaw, PhDData Scientist@metablake

An app that helps you explore your city and connect with friends

A pla5orm for loca7on based services and data

What is Foursquare?

People use foursquare to:

– share with friends– discover new places– get 7ps– get deals– earn points and badges– keep track of visits

What is Foursquare?

Mobile Social

Local

What is Foursquare?

20,000,000+ people30,000,000+ places2,000,000,000+ check-‐ins1500+ ac7ons/second

Stats

Video: hIp://vimeo.com/29323612

http://www.youtube.com/watch?v=jfj_nJ6pvFQ

http://www.youtube.com/watch?v=jfj_nJ6pvFQ

• Intro to Foursquare Data•Mining Signals from Check-‐ins

•Data Pipeline•Data Infrastructure

• Past

• Present

• Future

Overview

Explore

A social recommenda7on engine built from check-‐in data

Central Park JFK

What is a place?

Time signatures for places

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan30

40

50

60

70

80

90

100

Tem

pera

ture

(f)

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan

0.3

0.4

0.5

0.6

0.7

Month in 2011 (New York)

% o

f che

ckin

s

ice cream shops

Ice cream?

Warm weather spots

– ice cream shops

– roof decks

– boats or ferries

– harbors or marinas

– sculpture gardens

– tracks

– basketball courts

– parks

Cold weather spots

– lakes

– basketball stadiums

– hockey stadiums

– art galleries

– ska7ng rinks

– bou7ques

– steakhouses

– ramen or noodle house

Check-ins and the weather

•Critical for our recommendation engine•Large sparse k-nearest neighbor problem– Items can be places, people, brands–Different distance metrics– Need to exploit sparsity otherwise

intractable

Finding similar items

•Metrics we find work best for recommending:

– Places: cosine similarity

– Friends: intersec7on

– Brands: Jacaard similarity

sim(xi,xj) =xixj

kxikkxjk

sim(A,B) = |A \B|

sim(A,B) = |A\B||A[B|

Finding similar items

X 2 Rn⇥d

each entry is the log(# of checkins at place i by user j)

one row for every 30m venues...

K 2 Rn⇥n

Kij = sim(xi,xj)

=xixj

kxikkxjk

Computing venue similarity

K 2 Rn⇥n

Kij = sim(xi,xj)

=xixj

kxikkxjk

• Naive solu7on for compu7ng :

•Requires ~4.5m machines to compute in < 24 hours!!! and 3.6PB to store!

K

O(n2d)

Computing venue similarity

Venue similarity w/ map reduce

visited venues map

reduce

vi, vj

vi, vj

vi, vj

user

emit “all” pairs of visited venues for each user

Sum up each user’s score contribu7on to this pair of venues

score

score

key

key

score scorescore ...

...

final score



• Past

• Present

• Future

Overview

Data pipeline - stats1,500,000,000+ log events / week

2,500,000,000,000+ bytes / week

GB / day (compressed)for api collec7on

May 2011 May 2012Nov 2011

Data pipeline - stats

And lots of people are using it!– 100+ hive users.– several users with > 100 jobs.– ~ 700 MR jobs / day.

Data pipeline - stats

Data pipeline - background

Foursquare’s technology stack– Amazon EC2–MongoDB– Solr / elas7csearch– Scala• Lij web framework

– Flume (0.9.x aka old-‐gen)– Amazon S3

Data pipeline - overview

API / WWW

Flume

CollectorJSON

S3

.../collection-name/dt=2012-06-19/...

Hive MapReduce

mongodb

Export Process

Data pipeline - logs

Applications log JSON– some common fields (e.g. event id, 7mestamp, host)– data is par77oned by collec7on and date in S3.– one table per collec7on in Hive.

Flume for data transport.

API / WWW

Flume

CollectorJSON

S3

.../collection-name/dt=2012-06-19/...

Data pipeline - mongodbMongo data is nice to work with in MapReduce– info in logs can be stale.– certain aIributes not in logs.– can scan much less data.

MapReduce against production mongo cluster would degrade performance and/or cause denial-of-service.

S3

mongodb

Export Process

Data pipeline - analytics

Automated reporting– typically a Hive query -‐> google docs spreadsheet.

Ad hoc reporting– hive dashboard for entering query and receiving an email when results are ready.

– RoR

Data pipeline - beekeeper

Data pipeline - Summary

Log data and snapshots of mongo data are stored in S3.

Users query/analyze the data using Hive, Pig, and MapReduce.

Compiled data is inserted to mongo or google spreadsheets for reporting.



• Past

• Present

• Future

Overview

Data infrastructure - past

Elastic MapReduce– but we were keeping clusters con7nuously running.

Rudimentary workflow management– start daily repor7ng at 7me X. Hope that data is there.

– difficult to monitor.


Scaling the number of users was troublesome– most of the company uses hive.– hive server con7nuously crashed.– lots of memory issues.– resource conten7on.

Mongo data– converted to delimited records, which doesn’t always make sense.– incremental dumps -‐ some data not consistent (e.g. if two venues are

merged).– basic schema detec7on.– single threaded per-‐collec7on.


Hive and EMR “flows” supported for automated reporting.

lots of mapreduce tools written in ruby– everything else is scala• want to use common u7li7es

– installing gems on system is briIle



• Past

• Present

• Future

Overview

Data infrastructure - present

Introduced a lot of new systems

– Cloudera’s Hadoop Distro -‐ CDH3u3

– Oozie for workflow / data management

– Pig for repor7ng– Scaled back ruby / hive

dashboard– BSON mongo dumps– Scala MapReduce– Scoobi

36

Data infrastructure - CDH3u3

12-node cluster in EC2 on cc8.xlarge instances– data is in S3– fair scheduler (jobs run as submipng user)– performance improvements• skew means slowest reducer ojen defines wall-‐clock 7me

– signs of virtualiza7on– cpu bound (data compression)

Data infrastructure - oozie

Pros– beIer monitoring (though not perfect).– coordinators for dataset management are great.– oozie distributes job submission via map tasks.– SPoF but recovers ajer a restart (state stored in DB).

Cons– deployment is not ideal, it’s difficult to version workflows.– configura7on via XML -‐ lots of boilerplate

• we have a scaffolding script to bootstrap a workflow.

Oozie coordinator (the good)S3 / HDFS

Does data exist yet?

Dataset Instance

Workflow F

Dataset A

Depends On

Coordinator

Yes? Kickoff workflow!

Oozie XML (the bad)

Hello World in Oozie– just invoking HelloWorld#main

Pig for reporting

Converted some ruby streaming to Pig + Scala UDFs.

More natural than Hive for some reports, especially those that output to multiple locations.

Elephant-Bird (twitter), Piggybank (apache), Data-fu (LinkedIn) all great UDF resources.

Ad hoc reporting dashboard

Uses hive thrift server to validate syntax (via EXPLAIN)

Submits jobs as Oozie workflows.

– The query is a parameter to the workflow.

– queries run as the users that submit them.

42

Beekeeper

Thrift Server

$QUERY

1. EXPLAIN $QUERY

Oozie REST

2. submit workflow,query=$QUERY

3. (repeatedly)is workflow done?

Click to edit Master text styles

Hive dashboard - error

43

BSON data dumps

Full loads each day, parallelized.

mongodb’s native format is BSON.– Binary JSON– some extensions to JSON– schema-‐less

BSON data infrastructure

45

BSON

BSON Split / LZO compress

BSONObject

Thrift (scala codegen)

ThriftBsonInputFormat

Recordv2 SerDe / Scoobi Input

EBS SnapshotsMongo stores BSON on EBS.

-Periodic EBS Snapshots

Oozie Workflow to mount snapshot, split data, compress, upload to S3.

BSON InputFormat converts to BSONObject

Scala Codegen converts to Thrift Object

InputFormat for Thrift objects to use in MR.

Hive SerDe and Scoobi Inputs

Scooby

Not that Scooby!

Scoobi

A strongly-typed data flow language written in Scala.

Much easier than writing MapReduce, but still very flexible.

https://github.com/nicta/scoobi



Scoobi Example

Counting checkins

Data infrastructure - Data Joins

Joins in MapReduce are cumbersome.– Do them once!

Data infrastructure - Data JoinsCheckinsVenue Tips LikesCheckinsCheckins TipsTips LikesLikes

Checkins Tips LikesCheckinsCheckins TipsTips LikesLikesCheckins Tips LikesCheckinsCheckins TipsTips LikesLikes

Data Join

CheckinsVenue Tips LikesCheckinsCheckins TipsTips LikesLikes



• Past

• Present

• Future

Overview

Future Work

HCatalog– Makes Hive tables (including input formats and serdes) available to Pig and

MapReduce– Add support for Scoobi

Indexing/Hive-indexing

Relational / MPP database for analytics dashboarding

Key-value store for easily serving hadoop data in prod.

Replacing Flume 0.9.4

Open Source

Let us know what you might find useful

Join us!foursquare is hiring! 115+ people and growing

foursquare.com/jobsJoe CrobakSoftware Engineer@joecrobak

Blake Shaw, PhDData Scientist@metablake

hadoop @ foursquare

Documents

overview intro

data pipeline

data infrastructure

kij sim

data

users

mapreduce

scala