2014 feb 5_what_ishadoop_mda

WELCOME TO HADOOP Adam Muise – Hortonworks

Who am I?

Why are we here?

“Big Data” is the marke=ng term of the decade

What lurks behind the hype is the democra=za=on of Data.

You need to deal with Data.

You’re probably not as good at that as you think.

Put it away, delete it, tweet it, compress it, shred it, wikileak-‐it, put it in a database, put it in SAN/NAS, put it in the cloud, hide it in tape…

Let’s talk challenges…

Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume Volume

Volume

Volume Volume Volume

Volume

Volume Volume

Volume Volume

Volume Volume

Volume

Volume

Volume

Volume

Volume Volume

Volume


Volume

Volume Volume

Volume Volume


Volume

Volume Volume

Volume Volume

Volume Volume

Volume Volume

Volume Volume

Volume Volume

Volume Volume

Volume Volume

Volume Volume

Volume Volume

Volume

Volume

Volume

Volume

Volume Volume

Volume Volume

Volume

Volume

Volume

Volume Volume

Volume Volume

Volume

Volume

Volume

Volume Volume

Volume Volume

Volume

Volume

Volume

Volume Volume

Volume Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume Volume

Volume


Volume

Volume Volume

Volume Volume

Volume Volume

Volume Volume

Volume Volume

Volume Volume

Volume Volume

Volume Volume

Volume Volume

Volume Volume

Volume Volume

Volume Volume

Volume Volume

Volume Volume

Volume

Volume

Volume

Volume

Volume Volume

Volume Volume

Volume

Volume

Volume

Volume Volume

Volume Volume

Volume

Volume

Volume

Volume Volume

Volume Volume

Volume

Volume

Volume

Volume Volume

Volume Volume

Volume

Volume

Volume

Volume Volume

Volume Volume

Volume

Volume

Volume

Volume Volume

Volume Volume

Volume

Volume Volume

Volume Volume

Volume Volume

Volume

Volume

Volume

Volume Volume

Volume Volume

Volume

Volume

Volume

Volume Volume

Volume Volume

Volume

Volume

Volume

Volume Volume

Volume Volume

Volume

Volume

Volume

Volume Volume

Volume Volume

Volume

Volume

Volume

Volume Volume

Volume Volume

Volume

Volume

Volume

Volume Volume

Volume Volume

Volume

Volume

Volume

Volume Volume

Volume Volume

Volume

Volume

Volume

Volume Volume

Volume Volume

Volume

Volume

Volume

Volume Volume

Volume Volume

Volume

Volume

Storage, Management, Processing all become challenges with Data at

Volume

Tradi=onal technologies adopt a divide, drop, and conquer approach

The solu=on? EDW

Data Data Data

Data Data Data

Data Data Data

Yet Another EDW

Data Data Data

Data Data Data

Data Data Data

Analy=cal DB

Data Data Data

Data Data Data

Data Data Data OLTP

Data Data Data

Data Data Data

Data Data Data

Another EDW

Data Data Data

Data Data Data

Data Data Data

Ummm…you dropped something

Data Data Data

Data Data Data

Data Data Data Data Data Data

Data Data Data


Data Data Data

Data Data Data

Data Data Data

Data Data Data

Data Data Data

Data Data Data

Data Data Data

Data Data Data

Data Data Data

Data Data Data

Data Data Data

Data Data Data

Data Data Data

Data Data Data

Data Data Data

Data Data Data

Data Data Data Data Data

Data

Data Data Data

Data Data Data

EDW

Data Data Data

Data Data Data

Data Data Data

Yet Another EDW

Data Data Data

Data Data Data

Data Data Data

Analy=cal DB

Data Data Data

Data Data Data

Data Data Data

OLTP

Data Data Data

Data Data Data

Data Data Data

Another EDW

Data Data Data

Data Data Data

Data Data Data

Analyzing the data usually raises more interes=ng ques=ons…

…which leads to more data

Wait, you’ve seen this before.

Data Data Data

Data Data Data

Data Data Data

Data Data Data

Data Data Data

Data Data Data Data Data

Data

Data Data Data


Data Data Data

Data Data Data

Data Data Data

Data Data Data

Data Data Data

Data Data Data

Data Data Data

Data Data Data

Sausage Factory

Data Data Data

Data Data Data

Data Data Data … Data

Data Data …

Data begets Data.

What keeps us from Data?

“Prices, Stupid passwords, and Boring Sta=s=cs.” -‐ Hans Rosling

h"p://www.youtube.com/watch?v=hVimVzgtD6w

Your data silos are lonely places.

EDW

Data Data Data

Data Data Data

Data Data Data

Accounts

Data Data Data

Data Data Data

Data Data Data

Customers

Data Data Data

Data Data Data

Data Data Data

Web Proper=es

Data Data Data

Data Data Data

Data Data Data

… Data likes to be together.

EDW

Data Data Data

Data Data Data

Data Data Data

Accounts

Data Data Data

Data Data Data

Data Data Data

Customers

Data Data Data

Data Data Data

Data Data Data

Web Proper=es

Data Data Data

Data Data Data

Data Data Data

Data likes to socialize too. EDW

Data Data Data

Data Data Data

Data Data Data

Accounts

Data Data Data

Data Data Data

Data Data Data

Customers

Data Data Data

Data Data Data

Data Data Data

Web Proper=es

Data Data Data

Data Data Data

Data Data Data

Machine Data

Data Data Data

Data Data Data

Data Data Data

TwiYer

Data Data Data

Data Data Data

Data Data Data

Facebook

Data Data Data

Data Data Data

Data Data Data

CDR

Data Data Data

Data Data Data

Data Data Data

Weather Data

Data Data Data

Data Data Data

Data Data Data

New types of data don’t quite fit into your pris=ne view of the world.

My LiYle Data Empire

Data Data Data

Data

Data Data

Data Data Data

Logs

Data Data Data Data

Data

Data Data

Machine Data

Data Data Data Data

Data

Data Data

? ?

? ?

To resolve this, some people take hints from Lord Of The Rings...

…and create One-‐Schema-‐To-‐Rule-‐Them-‐All…

EDW

Data Data Data

Data Data Data

Data Data Data Schema

…but that has its problems too.

EDW

Data Data Data

Data Data Data

Data Data Data Schema Data

Data Data

ETL ETL

ETL ETL

EDW

Data Data Data

Data Data Data

Data Data Data Schema Data

Data Data

ETL ETL

ETL ETL

So what is the answer?

Enter the Hadoop.

hYp://www.fabulouslybroke.com/2011/05/ninja-‐elephants-‐and-‐other-‐awesome-‐stories/

………

Hadoop was created because Big IT never cut it for the Internet

Proper=es like Google, Yahoo, Facebook, TwiYer, and LinkedIn

Tradi=onal architecture didn’t scale enough…

DB DB DB

SAN

App App App App

DB DB DB

SAN

App App App App DB DB DB

SAN

App App App App

Databases become bloated and useless

Tradi=onal architectures cost too much at that volume…

$/TB

$pecial Hardware

$upercompu=ng

How would you fix this?

If you could design a system that would handle this, what would it

look like?

It would probably need a highly resilient, self-‐healing, cost-‐efficient,

distributed file system…

Storage Storage Storage



It would probably need a completely parallel processing framework that

took tasks to the data…



Storage Storage Storage Processing Processing Processing

Processing Processing Processing


It would probably run on commodity hardware, virtualized machines, and

common OS plaeorms






It would probably be open source so innova=on could happen as quickly

as possible

It would need a cri=cal mass of users

It would be Apache Hadoop

{Processing + Storage} =

{MapReduce/Tez/YARN+ HDFS}

HDFS stores data in blocks and replicates those blocks





Processing Processing Processing block3 block3

block3

block2 block2

block2

block1

block1

block1

If a block fails then HDFS always has the other copies and heals itself





Processing Processing Processing block3

block3

block3

block2 block2

block2

block1

block1

block1

X

MapReduce is a programming paradigm that completely parallel

Mapper

Mapper

Mapper

Mapper

Mapper

Reducer

Reducer

Reducer

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data

MapReduce has three phases: Map, Sort/Shuffle, Reduce

Mapper

Mapper

Mapper

Mapper

Mapper

Reducer

Reducer

Reducer

Key, Value Key, Value

Key, Value


Key, Value


Key, Value


Key, Value


Key, Value



Key, Value

Key, Value


Key, Value


Key, Value


Key, Value


Key, Value

MapReduce applies to a lot of data processing problems

Mapper

Mapper

Mapper

Mapper

Mapper

Reducer

Reducer

Reducer

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data

MapReduce goes a long way, but not all data processing and analy=cs

are solved the same way

Some=mes your data applica=on needs parallel processing and inter-‐

process communica=on

Process

Process

Process

Process

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data

…like Complex Event Processing in Apache Storm

Some=mes your machine learning data applica=on needs to process in

memory and iterate

Process

Process

Process

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data Process Process

Process

Process

Data

Data Data

…like in Machine Learning in Spark

Introducing YARN

YARN = Yet Another Resource Nego=ator

YARN abstracts resource management so you can run more

than just MapReduce

HDFS2

MapReduce V2

YARN MapReduce V? STORM

MPI Giraph HBase Tez … and more Spark

Resource Manager +

Node Managers = YARN

Resource Manager

AppMaster

Node Manager

Scheduler

AppMaster

AppMaster

Node Manager

Node Manager

Node Manager

Container

Container

MapReduce

Container

Storm

Container

Container

Container

Pig

Container

Container

Container

YARN turns Hadoop into a smart phone: An App Ecosystem

hortonworks.com/yarn/

Check out the book too…

Preview at: hortonworks.com/yarn/

YARN is an essen=al part of a balanced breakfast in Hadoop 2.x

Introducing Tez

Tez is a YARN applica=on, like MapReduce is a YARN applica=on

Tez is the Lego set for your data applica=on

Tez provides a layer for abstract tasks, these could be mappers, reducers, customized stream

processes, in memory structures, etc

Tez can chain tasks together into one job to get Map – Reduce – Reduce jobs

suitable for things like Hive SQL projec=ons, group by, and order by

TezMap

TezMap

TezMap

TezMap

TezMap

TezReduce

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data

TezReduce

TezReduce

TezReduce

TezReduce

TezReduce

Tez can provide long-‐running containers for applica=ons like Hive to side-‐step batch process startups you would have with MapReduce

Hadoop has other open source projects…

Hive = {SQL -‐> Tez || MapReduce} SQL-‐IN-‐HADOOP

Pig = {PigLa=n -‐> Tez || MapReduce}

HCatalog = {metadata* for MapReduce, Hive, Pig, HBase}

*metadata = tables, columns, par==ons, types

Oozie = Job::{Task, Task, if Task, then Task, final Task}

Falcon

Hadoop Hadoop Feed Feed

Feed Feed

Feed

Feed

Feed

Feed DR

Replica=on

Knox

Hadoop Cluster

REST Client Knox Gateway

Hadoop Cluster

REST Client

REST Client

Enterprise LDAP

Flume

JMS

Weblogs

Events

Files

Hadoop Flume

Flume

Flume

Flume

Flume

Flume

Sqoop

Hadoop

DB DB Sqoop

Sqoop

Ambari = {install, manage, monitor}

HBase = {real-‐=me, distributed-‐map, big-‐tables}

Storm = {Complex Event Processing, Near-‐Real-‐Time, Provisioned by

YARN }

Apache Hadoop

Flume Ambari

HBase Falcon

MapReduce HDFS

Sqoop HCatalog

Pig

Hive

Storm YARN

Knox

Tez

Hortonworks Data Plaeorm

Flume Ambari

HBase Falcon

MapReduce HDFS

Sqoop HCatalog

Pig

Hive

Storm YARN

Knox

Tez

What else are we working on?

hortonworks.com/labs/

Hadoop is the new Modern Data Architecture

2014 feb 5_what_ishadoop_mda

Technology

onaltechnologiesadopta

ngterm ofthedecade