2014 feb 5_what_ishadoop_mda
DESCRIPTION
What Is Hadoop update to include YARN, Tez, Spark, and StormTRANSCRIPT
WELCOME TO HADOOP Adam Muise – Hortonworks
Who am I?
Why are we here?
Data
“Big Data” is the marke=ng term of the decade
What lurks behind the hype is the democra=za=on of Data.
You need to deal with Data.
You’re probably not as good at that as you think.
Put it away, delete it, tweet it, compress it, shred it, wikileak-‐it, put it in a database, put it in SAN/NAS, put it in the cloud, hide it in tape…
Let’s talk challenges…
Volume
Volume
Volume
Volume
Volume
Volume
Volume
Volume
Volume Volume
Volume
Volume Volume Volume
Volume
Volume Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume
Volume Volume
Volume
Volume Volume Volume
Volume
Volume Volume
Volume Volume
Volume Volume Volume
Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume
Volume
Volume
Volume Volume
Volume
Volume Volume Volume
Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Storage, Management, Processing all become challenges with Data at
Volume
Tradi=onal technologies adopt a divide, drop, and conquer approach
The solu=on? EDW
Data Data Data
Data Data Data
Data Data Data
Yet Another EDW
Data Data Data
Data Data Data
Data Data Data
Analy=cal DB
Data Data Data
Data Data Data
Data Data Data OLTP
Data Data Data
Data Data Data
Data Data Data
Another EDW
Data Data Data
Data Data Data
Data Data Data
Ummm…you dropped something
Data Data Data
Data Data Data
Data Data Data Data Data Data
Data Data Data
Data Data Data Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data Data Data
Data
Data Data Data
Data Data Data
EDW
Data Data Data
Data Data Data
Data Data Data
Yet Another EDW
Data Data Data
Data Data Data
Data Data Data
Analy=cal DB
Data Data Data
Data Data Data
Data Data Data
OLTP
Data Data Data
Data Data Data
Data Data Data
Another EDW
Data Data Data
Data Data Data
Data Data Data
Analyzing the data usually raises more interes=ng ques=ons…
…which leads to more data
Wait, you’ve seen this before.
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data Data Data
Data
Data Data Data
Data Data Data Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Sausage Factory
Data Data Data
Data Data Data
Data Data Data … Data
Data Data …
Data begets Data.
What keeps us from Data?
“Prices, Stupid passwords, and Boring Sta=s=cs.” -‐ Hans Rosling
h"p://www.youtube.com/watch?v=hVimVzgtD6w
Your data silos are lonely places.
EDW
Data Data Data
Data Data Data
Data Data Data
Accounts
Data Data Data
Data Data Data
Data Data Data
Customers
Data Data Data
Data Data Data
Data Data Data
Web Proper=es
Data Data Data
Data Data Data
Data Data Data
… Data likes to be together.
EDW
Data Data Data
Data Data Data
Data Data Data
Accounts
Data Data Data
Data Data Data
Data Data Data
Customers
Data Data Data
Data Data Data
Data Data Data
Web Proper=es
Data Data Data
Data Data Data
Data Data Data
Data likes to socialize too. EDW
Data Data Data
Data Data Data
Data Data Data
Accounts
Data Data Data
Data Data Data
Data Data Data
Customers
Data Data Data
Data Data Data
Data Data Data
Web Proper=es
Data Data Data
Data Data Data
Data Data Data
Machine Data
Data Data Data
Data Data Data
Data Data Data
TwiYer
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
CDR
Data Data Data
Data Data Data
Data Data Data
Weather Data
Data Data Data
Data Data Data
Data Data Data
New types of data don’t quite fit into your pris=ne view of the world.
My LiYle Data Empire
Data Data Data
Data
Data Data
Data Data Data
Logs
Data Data Data Data
Data
Data Data
Machine Data
Data Data Data Data
Data
Data Data
? ?
? ?
To resolve this, some people take hints from Lord Of The Rings...
…and create One-‐Schema-‐To-‐Rule-‐Them-‐All…
EDW
Data Data Data
Data Data Data
Data Data Data Schema
…but that has its problems too.
EDW
Data Data Data
Data Data Data
Data Data Data Schema Data
Data Data
ETL ETL
ETL ETL
EDW
Data Data Data
Data Data Data
Data Data Data Schema Data
Data Data
ETL ETL
ETL ETL
So what is the answer?
Enter the Hadoop.
hYp://www.fabulouslybroke.com/2011/05/ninja-‐elephants-‐and-‐other-‐awesome-‐stories/
………
Hadoop was created because Big IT never cut it for the Internet
Proper=es like Google, Yahoo, Facebook, TwiYer, and LinkedIn
Tradi=onal architecture didn’t scale enough…
DB DB DB
SAN
App App App App
DB DB DB
SAN
App App App App DB DB DB
SAN
App App App App
Databases become bloated and useless
Tradi=onal architectures cost too much at that volume…
$/TB
$pecial Hardware
$upercompu=ng
How would you fix this?
If you could design a system that would handle this, what would it
look like?
It would probably need a highly resilient, self-‐healing, cost-‐efficient,
distributed file system…
Storage Storage Storage
Storage Storage Storage
Storage Storage Storage
It would probably need a completely parallel processing framework that
took tasks to the data…
Storage Storage Storage
Storage Storage Storage
Storage Storage Storage Processing Processing Processing
Processing Processing Processing
Processing Processing Processing
It would probably run on commodity hardware, virtualized machines, and
common OS plaeorms
Storage Storage Storage
Storage Storage Storage
Storage Storage Storage Processing Processing Processing
Processing Processing Processing
Processing Processing Processing
It would probably be open source so innova=on could happen as quickly
as possible
It would need a cri=cal mass of users
It would be Apache Hadoop
{Processing + Storage} =
{MapReduce/Tez/YARN+ HDFS}
HDFS stores data in blocks and replicates those blocks
Storage Storage Storage
Storage Storage Storage
Storage Storage Storage Processing Processing Processing
Processing Processing Processing
Processing Processing Processing block3 block3
block3
block2 block2
block2
block1
block1
block1
If a block fails then HDFS always has the other copies and heals itself
Storage Storage Storage
Storage Storage Storage
Storage Storage Storage Processing Processing Processing
Processing Processing Processing
Processing Processing Processing block3
block3
block3
block2 block2
block2
block1
block1
block1
X
MapReduce is a programming paradigm that completely parallel
Mapper
Mapper
Mapper
Mapper
Mapper
Reducer
Reducer
Reducer
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
MapReduce has three phases: Map, Sort/Shuffle, Reduce
Mapper
Mapper
Mapper
Mapper
Mapper
Reducer
Reducer
Reducer
Key, Value Key, Value
Key, Value
Key, Value Key, Value
Key, Value
Key, Value Key, Value
Key, Value
Key, Value Key, Value
Key, Value
Key, Value Key, Value
Key, Value
Key, Value Key, Value
Key, Value Key, Value
Key, Value
Key, Value
Key, Value Key, Value
Key, Value
Key, Value Key, Value
Key, Value
Key, Value Key, Value
Key, Value
Key, Value Key, Value
Key, Value
MapReduce applies to a lot of data processing problems
Mapper
Mapper
Mapper
Mapper
Mapper
Reducer
Reducer
Reducer
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
MapReduce goes a long way, but not all data processing and analy=cs
are solved the same way
Some=mes your data applica=on needs parallel processing and inter-‐
process communica=on
Process
Process
Process
Process
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
…like Complex Event Processing in Apache Storm
Some=mes your machine learning data applica=on needs to process in
memory and iterate
Process
Process
Process
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data Process Process
Process
Process
Data
Data Data
…like in Machine Learning in Spark
Introducing YARN
YARN = Yet Another Resource Nego=ator
YARN abstracts resource management so you can run more
than just MapReduce
HDFS2
MapReduce V2
YARN MapReduce V? STORM
MPI Giraph HBase Tez … and more Spark
Resource Manager +
Node Managers = YARN
Resource Manager
AppMaster
Node Manager
Scheduler
AppMaster
AppMaster
Node Manager
Node Manager
Node Manager
Container
Container
MapReduce
Container
Storm
Container
Container
Container
Pig
Container
Container
Container
YARN turns Hadoop into a smart phone: An App Ecosystem
hortonworks.com/yarn/
Check out the book too…
Preview at: hortonworks.com/yarn/
YARN is an essen=al part of a balanced breakfast in Hadoop 2.x
Introducing Tez
Tez is a YARN applica=on, like MapReduce is a YARN applica=on
Tez is the Lego set for your data applica=on
Tez provides a layer for abstract tasks, these could be mappers, reducers, customized stream
processes, in memory structures, etc
Tez can chain tasks together into one job to get Map – Reduce – Reduce jobs
suitable for things like Hive SQL projec=ons, group by, and order by
TezMap
TezMap
TezMap
TezMap
TezMap
TezReduce
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
TezReduce
TezReduce
TezReduce
TezReduce
TezReduce
Tez can provide long-‐running containers for applica=ons like Hive to side-‐step batch process startups you would have with MapReduce
Hadoop has other open source projects…
Hive = {SQL -‐> Tez || MapReduce} SQL-‐IN-‐HADOOP
Pig = {PigLa=n -‐> Tez || MapReduce}
HCatalog = {metadata* for MapReduce, Hive, Pig, HBase}
*metadata = tables, columns, par==ons, types
Oozie = Job::{Task, Task, if Task, then Task, final Task}
Falcon
Hadoop Hadoop Feed Feed
Feed Feed
Feed
Feed
Feed
Feed DR
Replica=on
Knox
Hadoop Cluster
REST Client Knox Gateway
Hadoop Cluster
REST Client
REST Client
Enterprise LDAP
Flume
JMS
Weblogs
Events
Files
Hadoop Flume
Flume
Flume
Flume
Flume
Flume
Sqoop
Hadoop
DB DB Sqoop
Sqoop
Ambari = {install, manage, monitor}
HBase = {real-‐=me, distributed-‐map, big-‐tables}
Storm = {Complex Event Processing, Near-‐Real-‐Time, Provisioned by
YARN }
Apache Hadoop
Flume Ambari
HBase Falcon
MapReduce HDFS
Sqoop HCatalog
Pig
Hive
Storm YARN
Knox
Tez
Hortonworks Data Plaeorm
Flume Ambari
HBase Falcon
MapReduce HDFS
Sqoop HCatalog
Pig
Hive
Storm YARN
Knox
Tez
What else are we working on?
hortonworks.com/labs/
Hadoop is the new Modern Data Architecture