hadoop and pig at twitter (oscon 2010)

Download Hadoop and pig at twitter (oscon 2010)

Post on 27-Jan-2015

108 views

Category:

Technology

4 download

Embed Size (px)

DESCRIPTION

A look into how Twitter uses Hadoop and Pig to build products and do analytics.

TRANSCRIPT

  • 1. Hadoop and Pig @Twitter Kevin Weil -- @kevinweil Analytics Lead, TwitterTM Friday, July 23, 2010

2. Agenda Hadoop Overview Pig: Rapid Learning Over Big Data Data-Driven Products Hadoop/Pig and Analytics Friday, July 23, 2010 3. My Background Mathematics and Physics at Harvard, Physics atStanford Tropos Networks (city-wide wireless): meshrouting algorithms, GBs of data Cooliris (web media): Hadoop and Pig foranalytics, TBs of data Twitter: Hadoop, Pig, HBase, Cassandra,machine learning, visualization, social graphanalysis, soon to be PBs dataFriday, July 23, 2010 4. Agenda Hadoop Overview Pig: Rapid Learning Over Big Data Data-Driven Products Hadoop/Pig and Analytics Friday, July 23, 2010 5. Data is Getting Big NYSE: 1 TB/day Facebook: 20+ TBcompressed/day CERN/LHC: 40 TB/day(15 PB/year) And growth isaccelerating Need multiple machines,horizontal scalability Friday, July 23, 2010 6. HadoopDistributed file system (hard to store a PB)Fault-tolerant, handles replication, node failure, etcMapReduce-based parallel computation (even harder to process a PB)Generic key-value based computation interface allows for wide applicability Friday, July 23, 2010 7. HadoopOpen source: top-level Apache projectScalable: Y! has a 4000-node clusterPowerful: sorted a TB of random integers in 62 secondsEasy Packaging: Cloudera RPMs, DEBs Friday, July 23, 2010 8. MapReduce Workflow Inputs Challenge: how many tweets per Map Shufe/Sortuser, given tweets table? Map Input: key=row, value=tweet infoOutputs Map Reduce Map: output key=user_id, Map Reducevalue=1 Map Reduce Shuffle: sort by user_id Map Reduce: for each user_id, sum Map Output: user_id, tweet count With 2x machines, runs 2x fasterFriday, July 23, 2010 9. MapReduce Workflow Inputs Challenge: how many tweets per Map Shufe/Sortuser, given tweets table? Map Input: key=row, value=tweet infoOutputs Map Reduce Map: output key=user_id, Map Reducevalue=1 Map Reduce Shuffle: sort by user_id Map Reduce: for each user_id, sum Map Output: user_id, tweet count With 2x machines, runs 2x fasterFriday, July 23, 2010 10. MapReduce Workflow Inputs Challenge: how many tweets per Map Shufe/Sortuser, given tweets table? Map Input: key=row, value=tweet infoOutputs Map Reduce Map: output key=user_id, Map Reducevalue=1 Map Reduce Shuffle: sort by user_id Map Reduce: for each user_id, sum Map Output: user_id, tweet count With 2x machines, runs 2x fasterFriday, July 23, 2010 11. MapReduce Workflow Inputs Challenge: how many tweets per Map Shufe/Sortuser, given tweets table? Map Input: key=row, value=tweet infoOutputs Map Reduce Map: output key=user_id, Map Reducevalue=1 Map Reduce Shuffle: sort by user_id Map Reduce: for each user_id, sum Map Output: user_id, tweet count With 2x machines, runs 2x fasterFriday, July 23, 2010 12. MapReduce Workflow Inputs Challenge: how many tweets per Map Shufe/Sortuser, given tweets table? Map Input: key=row, value=tweet infoOutputs Map Reduce Map: output key=user_id, Map Reducevalue=1 Map Reduce Shuffle: sort by user_id Map Reduce: for each user_id, sum Map Output: user_id, tweet count With 2x machines, runs 2x fasterFriday, July 23, 2010 13. MapReduce Workflow Inputs Challenge: how many tweets per Map Shufe/Sortuser, given tweets table? Map Input: key=row, value=tweet infoOutputs Map Reduce Map: output key=user_id, Map Reducevalue=1 Map Reduce Shuffle: sort by user_id Map Reduce: for each user_id, sum Map Output: user_id, tweet count With 2x machines, runs 2x fasterFriday, July 23, 2010 14. MapReduce Workflow Inputs Challenge: how many tweets per Map Shufe/Sortuser, given tweets table? Map Input: key=row, value=tweet infoOutputs Map Reduce Map: output key=user_id, Map Reducevalue=1 Map Reduce Shuffle: sort by user_id Map Reduce: for each user_id, sum Map Output: user_id, tweet count With 2x machines, runs 2x fasterFriday, July 23, 2010 15. But... Analysis typically in Java Single-input, two-stagedata flow is rigid Projections, filters:custom code Joins are lengthy, error-prone Hard to manage n-stage jobs Exploration requires compilation!Friday, July 23, 2010 16. Agenda Hadoop Overview Pig: Rapid Learning Over Big Data Data-Driven Products Hadoop/Pig and Analytics Friday, July 23, 2010 17. Enter Pig High level language Transformations onsets of records Process data one step at a time Easier than SQL? Top-level Apache projectFriday, July 23, 2010 18. Why Pig?Because I bet you can read the following script. Friday, July 23, 2010 19. A Real Pig Script Friday, July 23, 2010 20. Now, just for fun...The same calculation in vanilla MapReduce Friday, July 23, 2010 21. No, seriously. Friday, July 23, 2010 22. Pig Democratizes Large-scaleData Analysis The Pig version is:5% of the code5% of the development timeWithin 25% of the execution timeReadable, reusable Friday, July 23, 2010 23. One Thing Ive Learned Its easy to answer questions Its hard to ask the right questions Value the system that promotes innovation anditeration Friday, July 23, 2010 24. Agenda Hadoop Overview Pig: Rapid Learning Over Big Data Data-Driven Products Hadoop/Pig and Analytics Friday, July 23, 2010 25. MySQL, MySQL, MySQL We all start there. But MySQL is not built for analysis. select count(*) from users? Maybe. select count(*) from tweets? Uh... Imagine joining them. And grouping. Then sorting.Friday, July 23, 2010 26. Non-Pig Hadoop at Twitter Data Sink via Scribe Distributed Grep A few performance-critical, simple jobs People Search Friday, July 23, 2010 27. People Search? First real product built with Hadoop Find People Old version: offline process ona single node New version: complex graphcalculations, hit internal networkservices, custom indexing Faster, more reliable,more observable Friday, July 23, 2010 28. People Search Import user data into HBase Periodic MapReduce job reading from HBaseHits FlockDB, other internal services inmapperCustom partitioning Data sucked across to sharded, replicated,horizontally scalable, in-memory, low-latencyScala service Build a trie, do case folding/normalization,suggestions, etc Friday, July 23, 2010 29. Agenda Hadoop Overview Pig: Rapid Learning Over Big Data Data-Driven Products Hadoop/Pig and Analytics Friday, July 23, 2010 30. Order of OperationsCountingCorrelatingResearch/AlgorithmicLearningFriday, July 23, 2010 31. Counting How many requests per day? Whats the average latency? 95% latency? Whats the response code distribution? How many searches per day? Unique users? Whats the geographic breakdown of requests? How many tweets? From what clients? How many signups? Profile completeness? How many SMS notifications did we send? Friday, July 23, 2010 32. Correlating How does usage differ for mobile users? ... for desktop client users (Tweetdeck, etc)? Cohort analyses What services fail at the same time? What features get users hooked? What do successful users do often? How does tweet volume change over time?Friday, July 23, 2010 33. Research What can we infer from a users tweets? ... from the tweets of their followers? followees? What features tend to get a tweet retweeted? ... and what influences the retweet tree depth? Duplicate detection, language detection What graph structures lead to increased usage? Sentiment analysis, entity extraction User reputation Friday, July 23, 2010 34. If We Had More Time... HBase LZO compression and Hadoop Protocol buffers Our open source: hadoop-lzo, elephant-bird Analytics and Cassandra Friday, July 23, 2010 35. Questions?Follow me attwitter.com/kevinweil TM Friday, July 23, 2010