hadoop past, present and future

46
© Hortonworks Inc. 2013 Hadoop : Past, Present and Future Chris Harris Email : [email protected] Twitter : cj_harris5

Upload: codemotion

Post on 26-Jan-2015

134 views

Category:

Technology


3 download

DESCRIPTION

Ever wonder what Hadoop might look like in 12 months or 24 months or longer? Apache Hadoop MapReduce has undergone a complete re-haul to emerge as Apache Hadoop YARN, a generic compute fabric to support MapReduce and other application paradigms. As a result, Hadoop looks very different from itself 12 months ago. This talk will take you through some ideas for YARN itself and the many myriad ways it is really moving the needle for MapReduce, Pig, Hive, Cascading and other data-processing tools in the Hadoop ecosystem.

TRANSCRIPT

Page 1: Hadoop past, present and future

© Hortonworks Inc. 2013

Hadoop : Past, Present and Future Chris Harris Email : [email protected] Twitter : cj_harris5

Page 2: Hadoop past, present and future

© Hortonworks Inc. 2013

Past

Page 2

Page 3: Hadoop past, present and future

© Hortonworks Inc. 2013

A little history… it’s 2005

Page 4: Hadoop past, present and future

© Hortonworks Inc. 2013

A Brief History of Apache Hadoop

Page 4

2013

2005: Yahoo! creates team under E14 to work on Hadoop

Yahoo! begins to Operate at scale

Enterprise Hadoop

Apache Project Established

Hortonworks Data Platform

2004 2008 2010 2012 2006

Page 5: Hadoop past, present and future

© Hortonworks Inc. 2013

Key Hadoop Data Types

1.  Sentiment Understand how your customers feel about your brand and products – right now

2.  Clickstream Capture and analyze website visitors’ data trails and optimize your website

3.  Sensor/Machine Discover patterns in data streaming automatically from remote sensors and machines

4.  Geographic Analyze location-based data to manage operations where they occur

5.  Server Logs Research logs to diagnose process failures and prevent security breaches

6.  Unstructured (txt, video, pictures, etc..) Understand patterns in files across millions of web pages, emails, and documents

Value

Page 6: Hadoop past, present and future

© Hortonworks Inc. 2013

Hadoop is NOT

! ESB ! NoSQL ! HPC ! Relational ! Real-time ! The “Jack of all Trades”

Page 7: Hadoop past, present and future

© Hortonworks Inc. 2013

Hadoop 1

•  Limited up to 4,000 nodes per cluster •  O(# of tasks in a cluster) •  JobTracker bottleneck - resource management,

job scheduling and monitoring •  Only has one namespace for managing HDFS •  Map and Reduce slots are static •  Only job to run is MapReduce

Page 8: Hadoop past, present and future

© Hortonworks Inc. 2013

Hadoop 1 - Basics

B C A A A

A B C C B

MapReduce (Computation Framework)

HDFS (Storage Framework)

Page 9: Hadoop past, present and future

© Hortonworks Inc. 2013

Hadoop 1 - Reading Files

Rack1 Rack2 Rack3 RackN

read file (fsimage/edit) Hadoop Client

NameNode SNameNode

return DNs, block ids, etc.

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

checkpoint

heartbeat/ block report read blocks

Page 10: Hadoop past, present and future

© Hortonworks Inc. 2013

Hadoop 1 - Writing Files

Rack1 Rack2 Rack3 RackN

request write (fsimage/edit) Hadoop Client

NameNode SNameNode

return DNs, etc.

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

checkpoint

block report write blocks

replication pipelining

Page 11: Hadoop past, present and future

© Hortonworks Inc. 2013

Hadoop 1 - Running Jobs

Rack1 Rack2 Rack3 RackN

Hadoop Client

JobTracker

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

submit job

deploy job

part 0

map

reduce

shuffle

Page 12: Hadoop past, present and future

© Hortonworks Inc. 2013

Hadoop 1 - Security

Users

F I R E W A L L

LDAP/AD

Client Node/ Spoke Server

KDC

Hadoop Cluster

authN/authZ

service request

block token

delegate token

* block token is for accessing data

* delegate token is for running jobs

Encryption Plugin

Page 13: Hadoop past, present and future

© Hortonworks Inc. 2013

Hadoop 1 - APIs

! org.apache.hadoop.mapreduce.Partitioner ! org.apache.hadoop.mapreduce.Mapper ! org.apache.hadoop.mapreduce.Reducer ! org.apache.hadoop.mapreduce.Job

Page 14: Hadoop past, present and future

© Hortonworks Inc. 2013

Present

Page 14

Page 15: Hadoop past, present and future

© Hortonworks Inc. 2013

Hadoop 2

! Potentially up to 10,000 nodes per cluster ! O(cluster size) ! Supports multiple namespace for managing HDFS ! Efficient cluster utilization (YARN) ! MRv1 backward and forward compatible ! Any apps can integrate with Hadoop ! Beyond Java

Page 16: Hadoop past, present and future

© Hortonworks Inc. 2013

Hadoop 2 - Basics

Page 17: Hadoop past, present and future

© Hortonworks Inc. 2013

Hadoop 2 - Reading Files (w/ NN Federation)

Rack1 Rack2 Rack3 RackN

read file

fsimage/edit copy Hadoop Client NN1/ns1

SNameNode per NN

return DNs, block ids, etc.

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

checkpoint

register/ heartbeat/ block report

read blocks

fs sync Backup NN per NN

checkpoint

NN2/ns2 NN3/ns3 NN4/ns4

or

ns1 ns2 ns3 ns4

dn1, dn2 dn1, dn3

dn4, dn5 dn4, dn5

Block Pools

Page 18: Hadoop past, present and future

© Hortonworks Inc. 2013

Hadoop 2 - Writing Files

Rack1 Rack2 Rack3 RackN

request write

Hadoop Client

return DNs, etc.

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

write blocks

replication pipelining

fsimage/edit copy NN1/ns1

SNameNode per NN

checkpoint

block report

fs sync Backup NN per NN

checkpoint

NN2/ns2 NN3/ns3 NN4/ns4

or

Page 19: Hadoop past, present and future

© Hortonworks Inc. 2013

Hadoop 2 - Running Jobs

RackN

NodeManager

NodeManager

NodeManager

Rack2

NodeManager

NodeManager

NodeManager

Rack1

NodeManager

NodeManager

NodeManager

C2.1

C1.4

AM2

C2.2 C2.3

AM1

C1.3

C1.2

C1.1

Hadoop Client 1

Hadoop Client 2

create app2

submit app1

submit app2

create app1

ASM Scheduler queues

ASM Containers

NM ASM

Scheduler Resources

.......negotiates.......

.......reports to.......

.......partitions.......

ResourceManager

status report

Page 20: Hadoop past, present and future

© Hortonworks Inc. 2013

Hadoop 2 - Security

F I R E W A L L

LDAP/AD Knox Gateway Cluster

KDC

Hadoop Cluster

Enterprise/ Cloud SSO Provider

JDBC Client

REST Client

F I R E W A L L

DMZ

Browser(HUE) Native Hive/HBase Encryption

Page 21: Hadoop past, present and future

© Hortonworks Inc. 2013

Hadoop 2 - APIs

! org.apache.hadoop.yarn.api.ApplicationClientProtocol ! org.apache.hadoop.yarn.api.ApplicationMasterProtocol ! org.apache.hadoop.yarn.api.ContainerManagementProtocol

Page 22: Hadoop past, present and future

© Hortonworks Inc. 2013

Future

Page 22

Page 23: Hadoop past, present and future

© Hortonworks Inc. 2013

Apache Tez A New Hadoop Data Processing Framework

Page 23

Page 24: Hadoop past, present and future

© Hortonworks Inc. 2013

HDP: Enterprise Hadoop Distribution

Page 24

PLATFORM    SERVICES  

HADOOP    CORE  

Enterprise Readiness High Availability, Disaster Recovery, Rolling Upgrades, Security and Snapshots

HORTONWORKS    DATA  PLATFORM  (HDP)  

OPERATIONAL  SERVICES  

DATA  SERVICES  

HIVE  &  HCATALOG  PIG   HBASE  

HDFS  

MAP      

Hortonworks Data Platform (HDP) Enterprise Hadoop

•  The ONLY 100% open source and complete distribution

•  Enterprise grade, proven and tested at scale

•  Ecosystem endorsed to ensure interoperability

SQOOP  

FLUME  

NFS  

LOAD  &    EXTRACT  

WebHDFS  

KNOX*  

OOZIE  

AMBARI  

FALCON*  

YARN*      

TEZ*   OTHER  REDUCE*  

Page 25: Hadoop past, present and future

© Hortonworks Inc. 2013

Tez (“Speed”)

• What is it? – A data processing framework as an alternative to MapReduce – A new incubation project in the ASF

• Who else is involved? – 22 contributors: Hortonworks (13), Facebook, Twitter, Yahoo,

Microsoft

• Why does it matter? – Widens the platform for Hadoop use cases – Crucial to improving the performance of low-latency applications – Core to the Stinger initiative – Evidence of Hortonworks leading the community in the evolution

of Enterprise Hadoop

Page 26: Hadoop past, present and future

© Hortonworks Inc. 2013

Moving Hadoop Beyond MapReduce

• Low level data-processing execution engine • Built on YARN

• Enables pipelining of jobs • Removes task and job launch times • Does not write intermediate output to HDFS

– Much lighter disk and network usage

• New base of MapReduce, Hive, Pig, Cascading etc. • Hive and Pig jobs no longer need to move to the end of the queue between steps in the pipeline

Page 27: Hadoop past, present and future

© Hortonworks Inc. 2013

Tez - Core Idea

Task with pluggable Input, Processor & Output

YARN ApplicationMaster to run DAG of Tez Tasks

Input Processor

Task

Output

Tez Task - <Input, Processor, Output>

Page 28: Hadoop past, present and future

© Hortonworks Inc. 2013

Building Blocks for Tasks MapReduce ‘Map’

MapReduce ‘Reduce’

HDFS Input

Map Processor

MapReduce ‘Map’ Task

Sorted Output

Intermediate ‘Reduce’ for Map-Reduce-Reduce

Shuffle Input

Reduce Processor

Intermediate ‘Reduce’ for Map-Reduce-Reduce

Sorted Output

Shuffle Input

Reduce Processor

HDFS Output

MapReduce ‘Reduce’ Task

Special Pig/Hive ‘Map’

HDFS Input

Map Processor

Tez Task

Pipeline

Sorter Output

Special Pig/Hive ‘Reduce’

Shuffle Skip-

merge Input

Reduce Processor

Tez Task

Sorted Output

In-memory Map

HDFSInput

Map Processor

Tez Task

In-memor

y Sorted Output

Page 29: Hadoop past, present and future

© Hortonworks Inc. 2013

Pig/Hive-MR versus Pig/Hive-Tez SELECT a.state, COUNT(*), AVERAGE(c.price)

FROM a JOIN b ON (a.id = b.id)

JOIN c ON (a.itemId = c.itemId) GROUP BY a.state

Pig/Hive - MR Pig/Hive - Tez

I/O Synchronization Barrier

I/O Synchronization Barrier

Job 1

Job 2

Job 3

Single Job

Page 30: Hadoop past, present and future

© Hortonworks Inc. 2013

Tez on YARN: Going Beyond Batch

Tez Optimizes Execution New runtime engine for

more efficient data processing

Always-On Tez Service Low latency processing for all Hadoop data processing

Tez Task

Page 31: Hadoop past, present and future

© Hortonworks Inc. 2013

Apache Knox Secure Access to Hadoop

Page 32: Hadoop past, present and future

© Hortonworks Inc. 2013

Knox Initiative Make Hadoop security simple

Simplify security for both users

and operators.

Provide seamless access for users while securing cluster at

the perimeter, shielding the intricacies of the security

implementation.

Simplify Security

Deliver unified and centralized access to the Hadoop cluster.

Make Hadoop feel like a single

application to users.

Aggregate Access

Ensure service users are abstracted from where services are located and how services

are configured & scaled.

Client Agility

Page 33: Hadoop past, present and future

© Hortonworks Inc. 2013

Knox: Make Hadoop Security Simple

Hadoop Cluster

Authentication & Verification

Client

User Store KDC, AD, LDAP

{REST}! Knox Gateway

Page 34: Hadoop past, present and future

© Hortonworks Inc. 2013

Knox: Next Generation of Hadoop Security

•  All users see one end-point website

•  All online systems see one end-point RESTful service

•  Consistency across all interfaces and capabilities

•  Firewalled cluster that no end users need to access

•  More IT-friendly. Enables: – Systems admins – DB admins – Security admins – Network admins

Hadoop cluster

Gateway

firewallfirewall

end usersonline apps

+analytics tools

Page 35: Hadoop past, present and future

© Hortonworks Inc. 2013

Apache Falcon Data Lifecycle Management for Hadoop

Page 36: Hadoop past, present and future

© Hortonworks Inc. 2013

Data Lifecycle on Hadoop is Challenging

Data Management Needs Tools Data Processing Oozie Replication Sqoop Retention Distcp Scheduling Flume Reprocessing Map / Reduce Multi Cluster Management Hive and Pig Jobs

Problem: Patchwork of tools complicate data lifecycle management. Result: Long development cycles and quality challenges.

Page 37: Hadoop past, present and future

© Hortonworks Inc. 2013

Falcon: One-stop Shop for Data Lifecycle

Apache Falcon Provides Orchestrates

Data Management Needs Tools Data Processing Oozie Replication Sqoop Retention Distcp Scheduling Flume Reprocessing Map / Reduce Multi Cluster Management Hive and Pig Jobs

Falcon provides a single interface to orchestrate data lifecycle. Sophisticated DLM easily added to Hadoop applications.

Page 38: Hadoop past, present and future

© Hortonworks Inc. 2013

Falcon At A Glance

>  Falcon provides the key services data processing applications need. >  Complex data processing logic handled by Falcon instead of hard-coded in apps. >  Faster development and higher quality for ETL, reporting and other data

processing apps on Hadoop.

Data Processing Applications

Spec Files or REST APIs

Data Import and

Replication

Scheduling and

Coordination

Data Lifecycle Policies

Multi-Cluster Management

SLA Management

Falcon Data Lifecycle Management Service

Page 39: Hadoop past, present and future

© Hortonworks Inc. 2013

Falcon Core Capabilities

• Core Functionality – Pipeline processing – Replication – Retention – Late data handling

• Automates – Scheduling and retry – Recording audit, lineage and metrics

• Operations and Management – Monitoring, management, metering – Alerts and notifications – Multi Cluster Federation

• CLI and REST API

Page 40: Hadoop past, present and future

© Hortonworks Inc. 2013

Falcon Example: Multi-Cluster Failover

>  Falcon manages workflow, replication or both. >  Enables business continuity without requiring full data reprocessing. >  Failover clusters require less storage and CPU.

Staged Data

Cleansed Data

Conformed Data

Presented Data

Staged Data

Presented Data

BI and Analytics

Primary Hadoop Cluster

Failover Hadoop Cluster

Rep

licat

ion

Page 41: Hadoop past, present and future

© Hortonworks Inc. 2013

>  Sophisticated retention policies expressed in one place. >  Simplify data retention for audit, compliance, or for data re-processing.

Falcon Example: Retention Policies

Staged Data

Retain 5 Years

Cleansed Data

Retain 3 Years

Conformed Data

Retain 3 Years

Presented Data

Retain Last Copy Only

Page 42: Hadoop past, present and future

© Hortonworks Inc. 2013

Falcon Example: Late Data Handling

>  Processing waits until all data is available. >  Developers don’t write complex data handling rules within applications.

Online Transaction

Data (Pull via Sqoop)

Web Log Data (Push via FTP)

Staging Area Combined Dataset

Wait up to 4 hours for FTP data

to arrive

Page 43: Hadoop past, present and future

© Hortonworks Inc. 2013

Multi Cluster Management with Prism

Page 43

>  Prism is the part of Falcon that handles multi-cluster. >  Key use cases: Replication and data processing that spans clusters.

Page 44: Hadoop past, present and future

© Hortonworks Inc. 2013

Hortonworks Sandbox Go from Zero to Big Data in 15 minutes

Page 44

Page 45: Hadoop past, present and future

© Hortonworks Inc. 2013

Sandbox: A Guided Tour of HDP

Page 45

Tutorials and videos give a guided tour of HDP and Hadoop Perfect for beginners or anyone learning more about Hadoop Installs easily on your laptop or desktop

Browse and manage HDFS files

Easily import data and create tables

Easy-to-use editors for Apache Pig and Hive

Latest tutorials pushed directly to your Sandbox

Page 46: Hadoop past, present and future

© Hortonworks Inc. 2013 Page 46

THANK YOU! Chris Harris

[email protected]

Download Sandbox

hortonworks.com/sandbox