hadoop @ foursquare
TRANSCRIPT
Hadoop @ Foursquare
Joe CrobakSoftware Engineer@joecrobak
Blake Shaw, PhDData Scientist@metablake
An app that helps you explore your city and connect with friends
A pla5orm for loca7on based services and data
What is Foursquare?
People use foursquare to:
– share with friends– discover new places– get 7ps– get deals– earn points and badges– keep track of visits
What is Foursquare?
Mobile Social
Local
What is Foursquare?
20,000,000+ people30,000,000+ places2,000,000,000+ check-‐ins1500+ ac7ons/second
Stats
Video: hIp://vimeo.com/29323612
• Intro to Foursquare Data•Mining Signals from Check-‐ins
•Data Pipeline•Data Infrastructure
• Past
• Present
• Future
Overview
Explore
A social recommenda7on engine built from check-‐in data
Central Park JFK
What is a place?
Time signatures for places
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan30
40
50
60
70
80
90
100
Tem
pera
ture
(f)
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan
0.3
0.4
0.5
0.6
0.7
Month in 2011 (New York)
% o
f che
ckin
s
ice cream shops
Ice cream?
Warm weather spots
– ice cream shops
– roof decks
– boats or ferries
– harbors or marinas
– sculpture gardens
– tracks
– basketball courts
– parks
Cold weather spots
– lakes
– basketball stadiums
– hockey stadiums
– art galleries
– ska7ng rinks
– bou7ques
– steakhouses
– ramen or noodle house
Check-ins and the weather
•Critical for our recommendation engine•Large sparse k-nearest neighbor problem– Items can be places, people, brands–Different distance metrics– Need to exploit sparsity otherwise
intractable
Finding similar items
•Metrics we find work best for recommending:
– Places: cosine similarity
– Friends: intersec7on
– Brands: Jacaard similarity
sim(xi,xj) =xixj
kxikkxjk
sim(A,B) = |A \B|
sim(A,B) = |A\B||A[B|
Finding similar items
X 2 Rn⇥d
each entry is the log(# of checkins at place i by user j)
one row for every 30m venues...
K 2 Rn⇥n
Kij = sim(xi,xj)
=xixj
kxikkxjk
Computing venue similarity
K 2 Rn⇥n
Kij = sim(xi,xj)
=xixj
kxikkxjk
• Naive solu7on for compu7ng :
•Requires ~4.5m machines to compute in < 24 hours!!! and 3.6PB to store!
K
O(n2d)
Computing venue similarity
Venue similarity w/ map reduce
visited venues map
reduce
vi, vj
vi, vj
vi, vj
user
emit “all” pairs of visited venues for each user
Sum up each user’s score contribu7on to this pair of venues
score
score
key
key
score scorescore ...
...
final score
• Intro to Foursquare Data•Mining Signals from Check-‐ins
•Data Pipeline•Data Infrastructure
• Past
• Present
• Future
Overview
Data pipeline - stats1,500,000,000+ log events / week
2,500,000,000,000+ bytes / week
GB / day (compressed)for api collec7on
May 2011 May 2012Nov 2011
Data pipeline - stats
And lots of people are using it!– 100+ hive users.– several users with > 100 jobs.– ~ 700 MR jobs / day.
Data pipeline - stats
Data pipeline - background
Foursquare’s technology stack– Amazon EC2–MongoDB– Solr / elas7csearch– Scala• Lij web framework
– Flume (0.9.x aka old-‐gen)– Amazon S3
Data pipeline - overview
API / WWW
Flume
CollectorJSON
S3
.../collection-name/dt=2012-06-19/...
Hive MapReduce
mongodb
Export Process
Data pipeline - logs
Applications log JSON– some common fields (e.g. event id, 7mestamp, host)– data is par77oned by collec7on and date in S3.– one table per collec7on in Hive.
Flume for data transport.
API / WWW
Flume
CollectorJSON
S3
.../collection-name/dt=2012-06-19/...
Data pipeline - mongodbMongo data is nice to work with in MapReduce– info in logs can be stale.– certain aIributes not in logs.– can scan much less data.
MapReduce against production mongo cluster would degrade performance and/or cause denial-of-service.
S3
mongodb
Export Process
Data pipeline - analytics
Automated reporting– typically a Hive query -‐> google docs spreadsheet.
Ad hoc reporting– hive dashboard for entering query and receiving an email when results are ready.
– RoR
Data pipeline - beekeeper
Data pipeline - Summary
Log data and snapshots of mongo data are stored in S3.
Users query/analyze the data using Hive, Pig, and MapReduce.
Compiled data is inserted to mongo or google spreadsheets for reporting.
• Intro to Foursquare Data•Mining Signals from Check-‐ins
•Data Pipeline•Data Infrastructure
• Past
• Present
• Future
Overview
Data infrastructure - past
Elastic MapReduce– but we were keeping clusters con7nuously running.
Rudimentary workflow management– start daily repor7ng at 7me X. Hope that data is there.
– difficult to monitor.
Data infrastructure - past
Scaling the number of users was troublesome– most of the company uses hive.– hive server con7nuously crashed.– lots of memory issues.– resource conten7on.
Mongo data– converted to delimited records, which doesn’t always make sense.– incremental dumps -‐ some data not consistent (e.g. if two venues are
merged).– basic schema detec7on.– single threaded per-‐collec7on.
Data infrastructure - past
Hive and EMR “flows” supported for automated reporting.
lots of mapreduce tools written in ruby– everything else is scala• want to use common u7li7es
– installing gems on system is briIle
• Intro to Foursquare Data•Mining Signals from Check-‐ins
•Data Pipeline•Data Infrastructure
• Past
• Present
• Future
Overview
Data infrastructure - present
Introduced a lot of new systems
– Cloudera’s Hadoop Distro -‐ CDH3u3
– Oozie for workflow / data management
– Pig for repor7ng– Scaled back ruby / hive
dashboard– BSON mongo dumps– Scala MapReduce– Scoobi
36
Data infrastructure - CDH3u3
12-node cluster in EC2 on cc8.xlarge instances– data is in S3– fair scheduler (jobs run as submipng user)– performance improvements• skew means slowest reducer ojen defines wall-‐clock 7me
– signs of virtualiza7on– cpu bound (data compression)
Data infrastructure - oozie
Pros– beIer monitoring (though not perfect).– coordinators for dataset management are great.– oozie distributes job submission via map tasks.– SPoF but recovers ajer a restart (state stored in DB).
Cons– deployment is not ideal, it’s difficult to version workflows.– configura7on via XML -‐ lots of boilerplate
• we have a scaffolding script to bootstrap a workflow.
Oozie coordinator (the good)S3 / HDFS
Does data exist yet?
Dataset Instance
Workflow F
Dataset A
Depends On
Coordinator
Yes? Kickoff workflow!
Oozie XML (the bad)
Hello World in Oozie– just invoking HelloWorld#main
Pig for reporting
Converted some ruby streaming to Pig + Scala UDFs.
More natural than Hive for some reports, especially those that output to multiple locations.
Elephant-Bird (twitter), Piggybank (apache), Data-fu (LinkedIn) all great UDF resources.
Ad hoc reporting dashboard
Uses hive thrift server to validate syntax (via EXPLAIN)
Submits jobs as Oozie workflows.
– The query is a parameter to the workflow.
– queries run as the users that submit them.
42
Beekeeper
Thrift Server
$QUERY
1. EXPLAIN $QUERY
Oozie REST
2. submit workflow,query=$QUERY
3. (repeatedly)is workflow done?
Click to edit Master text styles
Hive dashboard - error
43
BSON data dumps
Full loads each day, parallelized.
mongodb’s native format is BSON.– Binary JSON– some extensions to JSON– schema-‐less
BSON data infrastructure
45
BSON
BSON Split / LZO compress
BSONObject
Thrift (scala codegen)
ThriftBsonInputFormat
Recordv2 SerDe / Scoobi Input
EBS SnapshotsMongo stores BSON on EBS.
-Periodic EBS Snapshots
Oozie Workflow to mount snapshot, split data, compress, upload to S3.
BSON InputFormat converts to BSONObject
Scala Codegen converts to Thrift Object
InputFormat for Thrift objects to use in MR.
Hive SerDe and Scoobi Inputs
Scooby
Not that Scooby!
Scoobi
A strongly-typed data flow language written in Scala.
Much easier than writing MapReduce, but still very flexible.
https://github.com/nicta/scoobi
Scoobi Example
Counting checkins
Data infrastructure - Data Joins
Joins in MapReduce are cumbersome.– Do them once!
Data infrastructure - Data JoinsCheckinsVenue Tips LikesCheckinsCheckins TipsTips LikesLikes
Checkins Tips LikesCheckinsCheckins TipsTips LikesLikesCheckins Tips LikesCheckinsCheckins TipsTips LikesLikes
Data Join
CheckinsVenue Tips LikesCheckinsCheckins TipsTips LikesLikes
• Intro to Foursquare Data•Mining Signals from Check-‐ins
•Data Pipeline•Data Infrastructure
• Past
• Present
• Future
Overview
Future Work
HCatalog– Makes Hive tables (including input formats and serdes) available to Pig and
MapReduce– Add support for Scoobi
Indexing/Hive-indexing
Relational / MPP database for analytics dashboarding
Key-value store for easily serving hadoop data in prod.
Replacing Flume 0.9.4
Open Source
Let us know what you might find useful
Join us!foursquare is hiring! 115+ people and growing
foursquare.com/jobsJoe CrobakSoftware Engineer@joecrobak
Blake Shaw, PhDData Scientist@metablake