10 concepts the enterprise decision maker needs to understand about hadoop
TRANSCRIPT
![Page 1: 10 concepts the enterprise decision maker needs to understand about Hadoop](https://reader038.vdocuments.net/reader038/viewer/2022102322/586f77561a28ab10258b6793/html5/thumbnails/1.jpg)
10 concepts the enterprise decision maker needs to
understand about HadoopDonald Miner
Strata + Hadoop World 2016 – San JoseMarch 31st, 2016
![Page 3: 10 concepts the enterprise decision maker needs to understand about Hadoop](https://reader038.vdocuments.net/reader038/viewer/2022102322/586f77561a28ab10258b6793/html5/thumbnails/3.jpg)
Purpose of this talk
An honest and minimal introduction to Hadoop
Why is Hadoop popular?
What does Hadoop do well and why?
What is bad about Hadoop?
![Page 4: 10 concepts the enterprise decision maker needs to understand about Hadoop](https://reader038.vdocuments.net/reader038/viewer/2022102322/586f77561a28ab10258b6793/html5/thumbnails/4.jpg)
#1 - Hadoop masks being a distributed system
![Page 5: 10 concepts the enterprise decision maker needs to understand about Hadoop](https://reader038.vdocuments.net/reader038/viewer/2022102322/586f77561a28ab10258b6793/html5/thumbnails/5.jpg)
#1 - Hadoop masks being a distributed system
// This block of code defines the behavior of the map phasepublic void map(Object key, Text value, Context context
) throws IOException, InterruptedException {// Split the line of text into wordsStringTokenizer itr = new StringTokenizer(value.toString());
// Go through each word and send itwhile (itr.hasMoreTokens()) {
word.set(itr.nextToken());
// "I've seen this word once!"context.write(word, one);
}}
[1]$ hadoop fs -put hamlet.txt datz/hamlet.txt[2]$ hadoop fs -put macbeth.txt data/macbeth.txt[3]$ hadoop fs -mv datz/hamlet.txt data/hamlet.txt[4]$ hadoop fs -ls data/-rw-r–r– 1 don don 139k 2012-01-31 23:49 /user/don/data/caesar.txt-rw-r–r– 1 don don 180k 2013-09-25 20:45 /user/don/data/hamlet.txt-rw-r–r– 1 don don 117k 2013-09-25 20:46 /user/don/data/macbeth.txt
![Page 6: 10 concepts the enterprise decision maker needs to understand about Hadoop](https://reader038.vdocuments.net/reader038/viewer/2022102322/586f77561a28ab10258b6793/html5/thumbnails/6.jpg)
#1 - Hadoop masks being a distributed system
Why is this so important?
What does it not do for me?
![Page 7: 10 concepts the enterprise decision maker needs to understand about Hadoop](https://reader038.vdocuments.net/reader038/viewer/2022102322/586f77561a28ab10258b6793/html5/thumbnails/7.jpg)
#2 - Hadoop scales out linearly
The amount of data, the amount of time something takes,and the amount of hardware you have are linearly linked1
1. usually
![Page 8: 10 concepts the enterprise decision maker needs to understand about Hadoop](https://reader038.vdocuments.net/reader038/viewer/2022102322/586f77561a28ab10258b6793/html5/thumbnails/8.jpg)
#2 - Hadoop scales out linearly
Double the compute,Half the time!
![Page 9: 10 concepts the enterprise decision maker needs to understand about Hadoop](https://reader038.vdocuments.net/reader038/viewer/2022102322/586f77561a28ab10258b6793/html5/thumbnails/9.jpg)
#2 - Hadoop scales out linearly
Double the data,twice the time!
![Page 10: 10 concepts the enterprise decision maker needs to understand about Hadoop](https://reader038.vdocuments.net/reader038/viewer/2022102322/586f77561a28ab10258b6793/html5/thumbnails/10.jpg)
#2 - Hadoop scales out linearly
Double the compute,Double the computeThe same time!
![Page 11: 10 concepts the enterprise decision maker needs to understand about Hadoop](https://reader038.vdocuments.net/reader038/viewer/2022102322/586f77561a28ab10258b6793/html5/thumbnails/11.jpg)
#2 - Hadoop scales out linearlyData locality!
![Page 12: 10 concepts the enterprise decision maker needs to understand about Hadoop](https://reader038.vdocuments.net/reader038/viewer/2022102322/586f77561a28ab10258b6793/html5/thumbnails/12.jpg)
#2 - Hadoop scales out linearly
Why is this so important?
What does it not do for me?
![Page 13: 10 concepts the enterprise decision maker needs to understand about Hadoop](https://reader038.vdocuments.net/reader038/viewer/2022102322/586f77561a28ab10258b6793/html5/thumbnails/13.jpg)
#3 - Hadoop runs on commodity hardware
![Page 14: 10 concepts the enterprise decision maker needs to understand about Hadoop](https://reader038.vdocuments.net/reader038/viewer/2022102322/586f77561a28ab10258b6793/html5/thumbnails/14.jpg)
#3 - Hadoop runs on commodity hardware
• Non-proprietary• Easy to acquire (all it takes is $)• Value (not necessarily cheap)• Let software handle the hard problems
![Page 15: 10 concepts the enterprise decision maker needs to understand about Hadoop](https://reader038.vdocuments.net/reader038/viewer/2022102322/586f77561a28ab10258b6793/html5/thumbnails/15.jpg)
#3 - Hadoop runs on commodity hardware
Why is this so important?
What does it not do for me?
![Page 16: 10 concepts the enterprise decision maker needs to understand about Hadoop](https://reader038.vdocuments.net/reader038/viewer/2022102322/586f77561a28ab10258b6793/html5/thumbnails/16.jpg)
#4 - Hadoop handles unstructured data
Query languages like SQL assume some sort of structureRelational databases and other databases require structure
MapReduce/Spark is just Java/Scala/Python/etcYou can do anything Java can do
HDFS just stores filesYou can store anything in a file
![Page 17: 10 concepts the enterprise decision maker needs to understand about Hadoop](https://reader038.vdocuments.net/reader038/viewer/2022102322/586f77561a28ab10258b6793/html5/thumbnails/17.jpg)
#4 - Hadoop handles unstructured data
Why is this so important?
What does it not do for me?
![Page 18: 10 concepts the enterprise decision maker needs to understand about Hadoop](https://reader038.vdocuments.net/reader038/viewer/2022102322/586f77561a28ab10258b6793/html5/thumbnails/18.jpg)
#5 - In Hadoop, you load data first and ask questions later
BEFORE:ETL
Years of planningSchemas & ER Diagrams
![Page 19: 10 concepts the enterprise decision maker needs to understand about Hadoop](https://reader038.vdocuments.net/reader038/viewer/2022102322/586f77561a28ab10258b6793/html5/thumbnails/19.jpg)
LOAD DATA FIRST, ASK QUESTIONS LATER
Data is parsed/interpreted as it is loaded out of HDFS
WITH HADOOP:
#5 - In Hadoop, you load data first and ask questions later
![Page 20: 10 concepts the enterprise decision maker needs to understand about Hadoop](https://reader038.vdocuments.net/reader038/viewer/2022102322/586f77561a28ab10258b6793/html5/thumbnails/20.jpg)
#5 - In Hadoop, you load data first and ask questions later
![Page 21: 10 concepts the enterprise decision maker needs to understand about Hadoop](https://reader038.vdocuments.net/reader038/viewer/2022102322/586f77561a28ab10258b6793/html5/thumbnails/21.jpg)
Why is this so important?
What does it not do for me?
#5 - In Hadoop, you load data first and ask questions later
![Page 22: 10 concepts the enterprise decision maker needs to understand about Hadoop](https://reader038.vdocuments.net/reader038/viewer/2022102322/586f77561a28ab10258b6793/html5/thumbnails/22.jpg)
#6 - HDFS stores the data but has some major limitations• Stores files in folders
• Nobody cares what’s in your files
• Chunks large files into blocks (~64MB-2GB)• 3 replicas of each block• Blocks are scattered all over the place• Can scale to thousands of nodes and hundreds of petabytes
FILE BLOCKS
![Page 23: 10 concepts the enterprise decision maker needs to understand about Hadoop](https://reader038.vdocuments.net/reader038/viewer/2022102322/586f77561a28ab10258b6793/html5/thumbnails/23.jpg)
#6 - HDFS stores the data but has some major limitations
Limitations:• Low IOPs• Higher latency• Can’t edit files• Can’t handle small files• Low storage efficiency (33%)• Low throughput on single files
• But…• High aggregate throughput• Massive scale• Software only• Few bottlenecks
![Page 24: 10 concepts the enterprise decision maker needs to understand about Hadoop](https://reader038.vdocuments.net/reader038/viewer/2022102322/586f77561a28ab10258b6793/html5/thumbnails/24.jpg)
Why is this so important?
What does it not do for me?
#6 - HDFS stores the data but has some major limitations
![Page 25: 10 concepts the enterprise decision maker needs to understand about Hadoop](https://reader038.vdocuments.net/reader038/viewer/2022102322/586f77561a28ab10258b6793/html5/thumbnails/25.jpg)
#7 - YARN controls everything going on and is mostly behind the scenes• Controls the compute resources on the cluster• Was the key new feature in Hadoop 2.0• Abstracted resource management from MapReduce to be more
general• MapReduce became just any other application
• YARN is key in enabling multiple compute engines at once
![Page 26: 10 concepts the enterprise decision maker needs to understand about Hadoop](https://reader038.vdocuments.net/reader038/viewer/2022102322/586f77561a28ab10258b6793/html5/thumbnails/26.jpg)
Why is this so important?
What does it not do for me?
#7 - YARN controls everything going on and is mostly behind the scenes
![Page 27: 10 concepts the enterprise decision maker needs to understand about Hadoop](https://reader038.vdocuments.net/reader038/viewer/2022102322/586f77561a28ab10258b6793/html5/thumbnails/27.jpg)
#8 - MapReduce may be getting a bad rap, but it’s still really important (but other engines are important, too)
• Analyzes raw data in HDFS where the data is• Jobs are split into Mappers and Reducers
Reducers (you code this, too)Automatically Groups by the mapper’s output keyAggregate, count, statisticsOutputs to HDFS
Mappers (you code this)Loads data from HDFSFilter, transform, parseOutputs (key, value) pairs
![Page 28: 10 concepts the enterprise decision maker needs to understand about Hadoop](https://reader038.vdocuments.net/reader038/viewer/2022102322/586f77561a28ab10258b6793/html5/thumbnails/28.jpg)
#8 - MapReduce may be getting a bad rap, but it’s still really important (but other engines are important, too)
“MapReduce is slow”
“MapReduce is hard to use”
![Page 29: 10 concepts the enterprise decision maker needs to understand about Hadoop](https://reader038.vdocuments.net/reader038/viewer/2022102322/586f77561a28ab10258b6793/html5/thumbnails/29.jpg)
Real-time Large-scale analyticsAd-hoc
MapReduce!
#8 - MapReduce may be getting a bad rap, but it’s still really important (but other engines are important, too)
![Page 30: 10 concepts the enterprise decision maker needs to understand about Hadoop](https://reader038.vdocuments.net/reader038/viewer/2022102322/586f77561a28ab10258b6793/html5/thumbnails/30.jpg)
#8 - MapReduce may be getting a bad rap, but it’s still really important (but other engines are important, too)
Real-time Large-scale analyticsAd-hoc
MapReduce!Storm/streaming
![Page 31: 10 concepts the enterprise decision maker needs to understand about Hadoop](https://reader038.vdocuments.net/reader038/viewer/2022102322/586f77561a28ab10258b6793/html5/thumbnails/31.jpg)
#8 - MapReduce may be getting a bad rap, but it’s still really important (but other engines are important, too)
Real-time Large-scale analyticsAd-hoc
MapReduce!Storm/streaming Impala/HAWQ/Stinger
![Page 32: 10 concepts the enterprise decision maker needs to understand about Hadoop](https://reader038.vdocuments.net/reader038/viewer/2022102322/586f77561a28ab10258b6793/html5/thumbnails/32.jpg)
#8 - MapReduce may be getting a bad rap, but it’s still really important (but other engines are important, too)
Real-time Large-scale analyticsAd-hoc
MapReduce!Storm/streaming Impala/HAWQ/Stinger Spark
![Page 33: 10 concepts the enterprise decision maker needs to understand about Hadoop](https://reader038.vdocuments.net/reader038/viewer/2022102322/586f77561a28ab10258b6793/html5/thumbnails/33.jpg)
#8 - MapReduce may be getting a bad rap, but it’s still really important (but other engines are important, too)
Real-time Large-scale analyticsAd-hoc
MapReduce!Storm/streaming Spark
![Page 34: 10 concepts the enterprise decision maker needs to understand about Hadoop](https://reader038.vdocuments.net/reader038/viewer/2022102322/586f77561a28ab10258b6793/html5/thumbnails/34.jpg)
#8 - MapReduce may be getting a bad rap, but it’s still really important (but other engines are important, too)
Real-time Large-scale analyticsAd-hoc
MapReduce!Spark
Not everyone has this problem, but it’s a really interesting problem!
![Page 35: 10 concepts the enterprise decision maker needs to understand about Hadoop](https://reader038.vdocuments.net/reader038/viewer/2022102322/586f77561a28ab10258b6793/html5/thumbnails/35.jpg)
Why is this so important?
What does it not do for me?
#8 - MapReduce may be getting a bad rap, but it’s still really important
![Page 36: 10 concepts the enterprise decision maker needs to understand about Hadoop](https://reader038.vdocuments.net/reader038/viewer/2022102322/586f77561a28ab10258b6793/html5/thumbnails/36.jpg)
#9 - Hadoop is open sourceFree – money isn’t just a financial barrier, but also a bureaucratic one, too
Help yourself – Hadoop is a complex system underneath and sometimes you need to figure something out for yourself
Adoption – it’s easier to adopt, so adoption is more widespread
Expansion – can be extended by anyone
![Page 37: 10 concepts the enterprise decision maker needs to understand about Hadoop](https://reader038.vdocuments.net/reader038/viewer/2022102322/586f77561a28ab10258b6793/html5/thumbnails/37.jpg)
Why is this so important?
What does it not do for me?
#9 - Hadoop is open source
![Page 38: 10 concepts the enterprise decision maker needs to understand about Hadoop](https://reader038.vdocuments.net/reader038/viewer/2022102322/586f77561a28ab10258b6793/html5/thumbnails/38.jpg)
#10 - The Hadoop ecosystem is constantly growing and evolving
Not only do individual Hadoop components improve…
But Hadoop overall improves with new components that do new things differently.
And they piece together into something that gets a lot of work done.
![Page 39: 10 concepts the enterprise decision maker needs to understand about Hadoop](https://reader038.vdocuments.net/reader038/viewer/2022102322/586f77561a28ab10258b6793/html5/thumbnails/39.jpg)
Why is this so important?
What does it not do for me?
#10 - The Hadoop ecosystem is constantly growing and evolving
![Page 40: 10 concepts the enterprise decision maker needs to understand about Hadoop](https://reader038.vdocuments.net/reader038/viewer/2022102322/586f77561a28ab10258b6793/html5/thumbnails/40.jpg)
Play by Hadoop’s rules and it’ll give you what you want
![Page 41: 10 concepts the enterprise decision maker needs to understand about Hadoop](https://reader038.vdocuments.net/reader038/viewer/2022102322/586f77561a28ab10258b6793/html5/thumbnails/41.jpg)
10 concepts the enterprise decision maker needs to
understand about HadoopDonald Miner
Strata + Hadoop World 2016 – San JoseMarch 31st, 2016
[email protected]@donaldpminer