building and improving products with hadoop matthew rathbone
DESCRIPTION
Building and Improving Products with Hadoop Matthew Rathbone. What is Foursquare. Foursquare helps you explore the world around you. Meet up with friends, discover new places, and save money using your phone. 4 bn check-ins 35mm users 50mm POI 150 employees 1tb+ a day of data. - PowerPoint PPT PresentationTRANSCRIPT
2013
Building and Improving Products with HadoopMatthew Rathbone
2013
What is FoursquareFoursquare helps you explore the world around you.
Meet up with friends, discover new places, and save money using your phone.
4bn check-ins 35mm users 50mm POI 150 employees 1tb+ a day of data
2013
FIRST, A STORYhttp://www.flickr.com/photos/shannonpatrick17
2013
The Right Tool for the Job
• Nginx – Serving static files
• Perl – Regular expressions
• XML – Frustrating people
• Hadoop (Map Reduce) – Counting
2013
COUNTING – WHAT IS IT GOOD FORhttp://www.flickr.com/photos/blaahhi/
2013
2013
2013
2013
2013
2013
Statistically Improbable PhrasesStatistically Improbable Phrases
2013
SIPS use cases
• menu extraction• sentiment analysis• venue ratings• specific recommendations• search indexing• pricing data• facility information
2013
How is SIPS built?
Basically lots of counting.
2013
SIPS• Tokenize data with a language model (into N-
Grams)• built using tips, shouts, menu items, likes, etc
• Apply a TF-IDF algorithm (Term frequency, inverse document frequency)
• Global phrase count• Local phrase count ( in a venue )• Some Filtering and ranking
• Re-compute & deploy nightly
2013
WHY USE HADOOP?http://www.flickr.com/photos/dbrekke/
2013
SIPS – Without Hadoop
Potential Problems• Database Query Throttling• Venues are out of sync• Altering the algorithm could take forever
to populate for all venues• Where would you store the results? • What about debug data?• Does it scale to 10x, 100x?• What about other, similar workflows?
2013
SIPS – Hadoop Benefits• Quick Deployment
• Modular & Reusable
• Arbitrarily complex combination of many datasets
• Every step of the workflow creates value
2013
Apple Store - Downtown San Francisco
1 tip mentions "haircuts"
Search for "haircuts" in "san francisco" Apple store???
Fixed by looking at % of tips and overall frequency
“Hey Apple, how bout less shiny pizzazz and fancy haircuts and more fix-my-f!@#$-imac”
2013
Data & Modularity
2013
2013
2013
2013
ACTUALLY, IT’S A BIT MORE COMPLICATED
http://www.flickr.com/photos/bfishadow
2013
These benefits require infrastructure
2013
Dependency Management
Many options• Oozie (Apache)• Azkaban (LinkedIn)• Luigi ( Spotify, we <3 this )• Hamake ( Codeminders )• Chronos ( AirBNB)
2013
2013
Database / Log Ingestion
• Sqoop• Mongo-Hadoop• Kafka• Flume• Scribe• etc
2013
2013
MapReduce Friendly Datastore
A few obvious ones:• Hbase• Cassandra• Voldemort
we built our own, it’s very similar to Voldemort and uses the Hfile API
2013
2013
Getting started without all that stuff
2013
Components you likely don’t have
2013
The best way to start
Don’t use Hadoop.
*but pretend you do
2013
Other reasons to not use Hadoop• Your idea might not be very good
• Hadoop will slow you down to start with
• You don’t have enough infrastructure yet• build it when you need it
• V1 might not be that complex
• V1 could be a spreadsheet
2013
2013
2013
SIPS
Version 1• Off the shelf language model• A subset of Venues & Tips• Did not use Map Reduce• Did not push to production at all
2013
SIPS
Version 2• Started building our own language
model• Rewritten as a Map Reduce• Manually loaded data to production• Filters for English data only.
Tweak, improve, etc
2013
SIPSVersion 3
• Incorporated more data sources into our language model
• Deployment to KV store (auto)
• Incorporated lots of debug output
• Language pipeline also feeds sentiment analysis
Now we’re in the perfect place to iterate & improve
2013
…to explore data
2013
In Summary• Hadoop is good for counting, so use it for
counting
• Move quickly whenever possible and don’t worry about automation
• Bring in new production services as you need them
• Freedom!
20132013
[email protected]@rathboma
Bonus:http://hadoopweekly.comfrom my colleague, Joe Crobak (presenting later!)