![Page 1: Common and unique use cases for Apache Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081414/548fe337b479598e6a8b4f4d/html5/thumbnails/1.jpg)
Common and Unique Use Cases for Apache Hadoop August 30, 2011
![Page 2: Common and unique use cases for Apache Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081414/548fe337b479598e6a8b4f4d/html5/thumbnails/2.jpg)
Agenda
• What is Apache Hadoop? • Log Processing • Catching `Osama’ • Extract Transform Load (ETL) • AnalyBcs in HBase • Machine Learning • Final Thoughts
Copyright 2011 Cloudera Inc. All rights reserved
![Page 3: Common and unique use cases for Apache Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081414/548fe337b479598e6a8b4f4d/html5/thumbnails/3.jpg)
Exploding Data Volumes
• Online • Web-‐ready devices • Social media • Digital content • Smart grids
• Enterprise • TransacBons • R&D data • OperaBonal (control) data
Relational
Complex, Unstructured
Copyright 2011 Cloudera Inc. All rights reserved
2,500 exabytes of new informaBon in 2012 with Internet as primary driver
Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2 “ze\abytes” this year Source: An IDC White Paper -‐ sponsored by EMC. As the Economy Contracts, the Digital Universe Expands. May 2009
![Page 4: Common and unique use cases for Apache Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081414/548fe337b479598e6a8b4f4d/html5/thumbnails/4.jpg)
2005 2007 2009
Copyright 2011 Cloudera Inc. All rights reserved
2008 2004 2006 2010 2003 2002
Open Source, Web Crawler project created by Doug Cucng
Publishes MapReduce, GFS Paper
Open Source, MapReduce & HDFS project created by Doug Cucng
Runs 4,000 Node Hadoop Cluster
Hadoop wins Terabyte sort benchmark
Launches SQL Support for Hadoop
Releases CDH3 and Cloudera Enterprise
Origin of Hadoop How does an elephant sneak up on you?
![Page 5: Common and unique use cases for Apache Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081414/548fe337b479598e6a8b4f4d/html5/thumbnails/5.jpg)
MapReduce
Hadoop Distributed File System (HDFS)
• Consolidates Everything • Move complex and relaBonal data into a single repository
• Stores Inexpensively • Keep raw data always available • Use commodity hardware
• Processes at the Source • Eliminate ETL bo\lenecks • Mine data first, govern later
Copyright 2011 Cloudera Inc. All rights reserved
What is Apache Hadoop? Open Source Storage and Processing Engine
![Page 6: Common and unique use cases for Apache Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081414/548fe337b479598e6a8b4f4d/html5/thumbnails/6.jpg)
What is Apache Hadoop? The Standard Way Big Data Gets Done
• Hadoop is Flexible: • Structured, unstructured • Schema, no schema • High volume, merely terabytes • All kinds of analyBc applicaBons
• Hadoop is Open: 100% Apache-‐licensed open source
• Hadoop is Scalable: Proven at petabyte scale
• Benefits: • Controls costs by storing data more affordably per terabyte than any other
plalorm • Drives revenue by extracBng value from data that was previously out of reach
Copyright 2011 Cloudera Inc. All rights reserved
![Page 7: Common and unique use cases for Apache Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081414/548fe337b479598e6a8b4f4d/html5/thumbnails/7.jpg)
No Lock-‐In -‐ Investments in skills, services & hardware are preserved regardless of vendor choice
Community Development -‐ Hadoop & related projects are expanding at a rapid pace
Copyright 2011 Cloudera Inc. All rights reserved
Rich Ecosystem -‐ Dozens of complementary somware, hardware and services firms
What is Apache Hadoop? The Importance of Being Open
![Page 8: Common and unique use cases for Apache Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081414/548fe337b479598e6a8b4f4d/html5/thumbnails/8.jpg)
Agenda
• What is Apache Hadoop? • Log Processing • Catching `Osama’ • Extract Transform Load (ETL) • AnalyBcs in HBase • Machine Learning • Final Thoughts
Copyright 2011 Cloudera Inc. All rights reserved
![Page 9: Common and unique use cases for Apache Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081414/548fe337b479598e6a8b4f4d/html5/thumbnails/9.jpg)
• Common uses of logs
• Find or count events (grep)
grep “ERROR” file grep -‐c “ERROR” file
• Calculate metrics (performance or user behavior analysis)
awk ‘{sums[$1]+=$2; counts[$1]+=1} END {for(k in counts) {print sums[k]/counts [k]}}’
• InvesBgate user sessions
grep “USER” files … | sort | less
Log Processing A Perfect Fit
![Page 10: Common and unique use cases for Apache Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081414/548fe337b479598e6a8b4f4d/html5/thumbnails/10.jpg)
• Shoot…too much data
• Homegrown parallel processing omen done on per file basis, cause it’s easy
• No parallelism on a single large file
Log Processing A Perfect Fit
access_log
Task 0
access_log
Task 1
access_log
Task 2
![Page 11: Common and unique use cases for Apache Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081414/548fe337b479598e6a8b4f4d/html5/thumbnails/11.jpg)
• MapReduce to the rescue!
• Processing is done per unit of data
Log Processing A Perfect Fit
Task 0
0-‐64MB 64-‐128MB 128-‐192MB 192-‐256MB
Task 1 Task 2 Task 3
Each task is responsible for a unit of data
access_log
![Page 12: Common and unique use cases for Apache Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081414/548fe337b479598e6a8b4f4d/html5/thumbnails/12.jpg)
• Network or disk are bo\lenecks
• Reading 100GB of data
• 14 minutes with 1GbE network connecBon
• 22 minutes on standard disk drive
Log Processing A Perfect Fit
grep Bandwidth is limited
access_log
![Page 13: Common and unique use cases for Apache Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081414/548fe337b479598e6a8b4f4d/html5/thumbnails/13.jpg)
• Hadoop to the rescue!
• Eliminates network bo\leneck, data is on local disk
• Data is read from many, many disks in parallel
Log Processing A Perfect Fit
Task 0
0-‐64MB
Task 1
64-‐128MB
Task 2
128-‐192MB
Task 3
192-‐256MB
NodeA NodeY NodeX NodeZ
Physical Machines
![Page 14: Common and unique use cases for Apache Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081414/548fe337b479598e6a8b4f4d/html5/thumbnails/14.jpg)
• Hadoop currently scales to 4,000 nodes
• Goal for next release is 10,000 nodes
• Nodes typically have 12 hard drives
• A single hard drive has throughput of about 75MB/second
• 12 Hard Drives * 75 MB/second * 4000 Nodes = 3.4 TB/second
• That’s bytes, not bits
• That’s enough bandwidth to read 1PB (1000 TB) in 5 minutes
Log Processing A Perfect Fit
![Page 15: Common and unique use cases for Apache Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081414/548fe337b479598e6a8b4f4d/html5/thumbnails/15.jpg)
Agenda
• What is Apache Hadoop? • Log Processing • Catching `Osama’ • Extract Transform Load (ETL) • AnalyBcs in HBase • Machine Learning • Final Thoughts
Copyright 2011 Cloudera Inc. All rights reserved
![Page 16: Common and unique use cases for Apache Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081414/548fe337b479598e6a8b4f4d/html5/thumbnails/16.jpg)
• You have a few billion images of faces with geo-‐tags
• Tremendous storage problem
• Tremendous processing problem
• Bandwidth
• CoordinaBon
Catching `Osama’ Embarrassingly Parallel
![Page 17: Common and unique use cases for Apache Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081414/548fe337b479598e6a8b4f4d/html5/thumbnails/17.jpg)
• Store the images in Hadoop
• When processing, Hadoop will read the images from local disk, thousands of local disks spread throughout the cluster
• Use Map only job to compare input images against `needle’ image
Catching `Osama’ Embarrassingly Parallel
![Page 18: Common and unique use cases for Apache Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081414/548fe337b479598e6a8b4f4d/html5/thumbnails/18.jpg)
Catching `Osama’ Embarrassingly Parallel
Store images in Sequence Files
Map Task 0
Map Task 1
Tasks have copy of `needle’
Output faces `matching’ needle
![Page 19: Common and unique use cases for Apache Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081414/548fe337b479598e6a8b4f4d/html5/thumbnails/19.jpg)
Agenda
• What is Apache Hadoop? • Log Processing • Catching `Osama’ • Extract Transform Load (ETL) • AnalyBcs in HBase • Machine Learning • Final Thoughts
Copyright 2011 Cloudera Inc. All rights reserved
![Page 20: Common and unique use cases for Apache Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081414/548fe337b479598e6a8b4f4d/html5/thumbnails/20.jpg)
• One of the most common use cases I see is replacing ETL processes
• Hadoop is a huge sink of cheap storage and processing
• Aggregates built in Hadoop and exported
• Apache Hive provides SQL like querying on raw data
Extract Transform Load (ETL) Everyone is doing it
![Page 21: Common and unique use cases for Apache Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081414/548fe337b479598e6a8b4f4d/html5/thumbnails/21.jpg)
Extract Transform Load (ETL) Everyone is doing it
Online DB
`Real’ Time System (Website)
ETL
AnalyBcal DB
Data Warehouse
Business Intelligence ApplicaBons
Much blood shed, here
![Page 22: Common and unique use cases for Apache Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081414/548fe337b479598e6a8b4f4d/html5/thumbnails/22.jpg)
Extract Transform Load (ETL) Everyone is doing it
Online DB
`Real’ Time System (Website)
AnalyBcal DB
Data Warehouse
Business Intelligence ApplicaBons
Hadoop Import Export
![Page 23: Common and unique use cases for Apache Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081414/548fe337b479598e6a8b4f4d/html5/thumbnails/23.jpg)
Extract Transform Load (ETL) Everyone is doing it
Online DB
`Real’ Time System (Website)
AnalyBcal DB
Data Warehouse
Business Intelligence ApplicaBons
Hadoop Apache Sqoop Apache Sqoop
![Page 24: Common and unique use cases for Apache Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081414/548fe337b479598e6a8b4f4d/html5/thumbnails/24.jpg)
Agenda
• What is Apache Hadoop? • Log Processing • Catching `Osama’ • Extract Transform Load (ETL) • AnalyBcs in HBase • Machine Learning • Final Thoughts
Copyright 2011 Cloudera Inc. All rights reserved
![Page 25: Common and unique use cases for Apache Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081414/548fe337b479598e6a8b4f4d/html5/thumbnails/25.jpg)
• AnalyBcs is omen simply counBng things
• Facebook chose HBase to store it’s massive counter infrastructure (more later)
• How might one implement a counter infrastructure in HBase?
AnalyScs in HBase Scaling writes
![Page 26: Common and unique use cases for Apache Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081414/548fe337b479598e6a8b4f4d/html5/thumbnails/26.jpg)
AnalyScs in HBase Scaling writes
URL Counter
com.cloudera/blog/… 154
com.cloudera/downloads/… 923621
com.cloudera/resources/… 2138
User Content Counter
[email protected] NEWS 5431
[email protected] TECH 79310
[email protected] SHOPPING 59
[email protected] SPORTS 94214
Individual Page Counters
User & Content Type Counters `Like’ bu\on IMG request sends HTTP request to Facebook servers which
increments several counters
![Page 27: Common and unique use cases for Apache Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081414/548fe337b479598e6a8b4f4d/html5/thumbnails/27.jpg)
AnalyScs in HBase Scaling writes
URL Counter
com.cloudera/blog/… 154
com.cloudera/downloads/… 923621
com.cloudera/resources/… 2138
Individual Page Counters
Host is reversed in URL as part of the key
• Data is physically stored in sorted order
• Scanning all `com.cloudera’ counters results in sequenBal I/O
![Page 28: Common and unique use cases for Apache Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081414/548fe337b479598e6a8b4f4d/html5/thumbnails/28.jpg)
• Real-‐Bme counters of URLs shared, links “liked”, impressions generated
• 20 billion events/day (200K events/sec)
• ~30 second latency from click to count
• Heavy use of incrementColumnValue API for consistent counters
• Tried MySQL, Cassandra, se\led on HBase h\p://Bny.cloudera.com/hbase-‐�-‐analyBcs
Facebook AnalyScs Scaling writes
![Page 29: Common and unique use cases for Apache Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081414/548fe337b479598e6a8b4f4d/html5/thumbnails/29.jpg)
Agenda
• What is Apache Hadoop? • Log Processing • Catching `Osama’ • Extract Transform Load (ETL) • AnalyBcs in HBase • Machine Learning • Final Thoughts
Copyright 2011 Cloudera Inc. All rights reserved
![Page 30: Common and unique use cases for Apache Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081414/548fe337b479598e6a8b4f4d/html5/thumbnails/30.jpg)
Machine Learning Apache Mahout
Text Clustering on Google News
![Page 31: Common and unique use cases for Apache Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081414/548fe337b479598e6a8b4f4d/html5/thumbnails/31.jpg)
Machine Learning Apache Mahout
CollaboraBve Filtering on Amazon
![Page 32: Common and unique use cases for Apache Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081414/548fe337b479598e6a8b4f4d/html5/thumbnails/32.jpg)
Machine Learning Apache Mahout
ClassificaBon in GMail
![Page 33: Common and unique use cases for Apache Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081414/548fe337b479598e6a8b4f4d/html5/thumbnails/33.jpg)
Machine Learning Apache Mahout
• Apache Mahout implements
• CollaboraBve Filtering
• ClassificaBon
• Clustering
• Frequent itemset
• More coming with the integraBon of MapReduce.Next
![Page 34: Common and unique use cases for Apache Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081414/548fe337b479598e6a8b4f4d/html5/thumbnails/34.jpg)
Agenda
• What is Apache Hadoop? • Log Processing • Catching `Osama’ • Extract Transform Load (ETL) • AnalyBcs in HBase • Machine Learning • Final Thoughts
Copyright 2011 Cloudera Inc. All rights reserved
![Page 35: Common and unique use cases for Apache Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081414/548fe337b479598e6a8b4f4d/html5/thumbnails/35.jpg)
• Other use cases
• OpenTSDB an open distributed, scalable Time Series Database (TSDB)
• Building Search Indexes (canonical use case)
• Facebook Messaging
• Cheap and Deep Storage, e.g. archiving emails for SOX compliance
• Audit Logging
• Non-‐Use Cases
• Data processing is handled by one beefy server
• Data requires transacBons
Final Thoughts Use the right tool
![Page 36: Common and unique use cases for Apache Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081414/548fe337b479598e6a8b4f4d/html5/thumbnails/36.jpg)
• Brock Noland
• h\p://twi\er.com/brocknoland
• TC-‐HUG h\p://tch.ug
About the Presenter