hadoop past, present and future

© Hortonworks Inc. 2013

Hadoop : Past, Present and Future Chris Harris Email : [email protected] Twitter : cj_harris5


Past


A little history… it’s 2005


A Brief History of Apache Hadoop

2013

2005: Yahoo! creates team under E14 to work on Hadoop

Yahoo! begins to Operate at scale

Enterprise Hadoop

Apache Project Established

Hortonworks Data Platform

2004 2008 2010 2012 2006


Key Hadoop Data Types

1.  Sentiment Understand how your customers feel about your brand and products – right now

2.  Clickstream Capture and analyze website visitors’ data trails and optimize your website

3.  Sensor/Machine Discover patterns in data streaming automatically from remote sensors and machines

4.  Geographic Analyze location-based data to manage operations where they occur

5.  Server Logs Research logs to diagnose process failures and prevent security breaches

6.  Unstructured (txt, video, pictures, etc..) Understand patterns in files across millions of web pages, emails, and documents

Value


Hadoop is NOT

! ESB ! NoSQL ! HPC ! Relational ! Real-time ! The “Jack of all Trades”


Hadoop 1

•  Limited up to 4,000 nodes per cluster •  O(# of tasks in a cluster) •  JobTracker bottleneck - resource management,

job scheduling and monitoring •  Only has one namespace for managing HDFS •  Map and Reduce slots are static •  Only job to run is MapReduce


Hadoop 1 - Basics

B C A A A

A B C C B

MapReduce (Computation Framework)

HDFS (Storage Framework)


Hadoop 1 - Security

Users

F I R E W A L L

LDAP/AD

Client Node/ Spoke Server

KDC

Hadoop Cluster

authN/authZ

service request

block token

delegate token

* block token is for accessing data

* delegate token is for running jobs

Encryption Plugin


Hadoop 1 - APIs

! org.apache.hadoop.mapreduce.Partitioner ! org.apache.hadoop.mapreduce.Mapper ! org.apache.hadoop.mapreduce.Reducer ! org.apache.hadoop.mapreduce.Job


Present


Hadoop 2

! Potentially up to 10,000 nodes per cluster ! O(cluster size) ! Supports multiple namespace for managing HDFS ! Efficient cluster utilization (YARN) ! MRv1 backward and forward compatible ! Any apps can integrate with Hadoop ! Beyond Java


Hadoop 2 - Basics


Hadoop 2 - Running Jobs

RackN

NodeManager

NodeManager

NodeManager

Rack2

NodeManager

NodeManager

NodeManager

Rack1

NodeManager

NodeManager

NodeManager

C2.1

C1.4

AM2

C2.2 C2.3

AM1

C1.3

C1.2

C1.1

Hadoop Client 1

Hadoop Client 2

create app2

submit app1

submit app2

create app1

ASM Scheduler queues

ASM Containers

NM ASM

Scheduler Resources

.......negotiates.......

.......reports to.......

.......partitions.......

ResourceManager

status report


Hadoop 2 - Security

F I R E W A L L

LDAP/AD Knox Gateway Cluster

KDC

Hadoop Cluster

Enterprise/ Cloud SSO Provider

JDBC Client

REST Client

F I R E W A L L

DMZ

Browser(HUE) Native Hive/HBase Encryption


Hadoop 2 - APIs

! org.apache.hadoop.yarn.api.ApplicationClientProtocol ! org.apache.hadoop.yarn.api.ApplicationMasterProtocol ! org.apache.hadoop.yarn.api.ContainerManagementProtocol


Future


Apache Tez A New Hadoop Data Processing Framework


HDP: Enterprise Hadoop Distribution

PLATFORM SERVICES

HADOOP CORE

Enterprise Readiness High Availability, Disaster Recovery, Rolling Upgrades, Security and Snapshots

HORTONWORKS DATA PLATFORM (HDP)

OPERATIONAL SERVICES

DATA SERVICES

HIVE & HCATALOG PIG HBASE

HDFS

MAP

Hortonworks Data Platform (HDP) Enterprise Hadoop

•  The ONLY 100% open source and complete distribution

•  Enterprise grade, proven and tested at scale

•  Ecosystem endorsed to ensure interoperability

SQOOP

FLUME

NFS

LOAD & EXTRACT

WebHDFS

KNOX*

OOZIE

AMBARI

FALCON*

YARN*

TEZ* OTHER REDUCE*


Tez (“Speed”)

• What is it? – A data processing framework as an alternative to MapReduce – A new incubation project in the ASF

• Who else is involved? – 22 contributors: Hortonworks (13), Facebook, Twitter, Yahoo,

Microsoft

• Why does it matter? – Widens the platform for Hadoop use cases – Crucial to improving the performance of low-latency applications – Core to the Stinger initiative – Evidence of Hortonworks leading the community in the evolution

of Enterprise Hadoop


Moving Hadoop Beyond MapReduce

• Low level data-processing execution engine • Built on YARN

• Enables pipelining of jobs • Removes task and job launch times • Does not write intermediate output to HDFS

– Much lighter disk and network usage

• New base of MapReduce, Hive, Pig, Cascading etc. • Hive and Pig jobs no longer need to move to the end of the queue between steps in the pipeline


Tez - Core Idea

Task with pluggable Input, Processor & Output

YARN ApplicationMaster to run DAG of Tez Tasks

Input Processor

Task

Output

Tez Task - <Input, Processor, Output>


Building Blocks for Tasks MapReduce ‘Map’

MapReduce ‘Reduce’

HDFS Input

Map Processor

MapReduce ‘Map’ Task

Sorted Output

Intermediate ‘Reduce’ for Map-Reduce-Reduce

Shuffle Input

Reduce Processor

Intermediate ‘Reduce’ for Map-Reduce-Reduce

Sorted Output

Shuffle Input

Reduce Processor

HDFS Output

MapReduce ‘Reduce’ Task

Special Pig/Hive ‘Map’

HDFS Input

Map Processor

Tez Task

Pipeline

Sorter Output

Special Pig/Hive ‘Reduce’

Shuffle Skip-

merge Input

Reduce Processor

Tez Task

Sorted Output

In-memory Map

HDFSInput

Map Processor

Tez Task

In-memor

y Sorted Output


Pig/Hive-MR versus Pig/Hive-Tez SELECT a.state, COUNT(*), AVERAGE(c.price)

FROM a JOIN b ON (a.id = b.id)

JOIN c ON (a.itemId = c.itemId) GROUP BY a.state

Pig/Hive - MR Pig/Hive - Tez

I/O Synchronization Barrier

I/O Synchronization Barrier

Job 1

Job 2

Job 3

Single Job


Tez on YARN: Going Beyond Batch

Tez Optimizes Execution New runtime engine for

more efficient data processing

Always-On Tez Service Low latency processing for all Hadoop data processing

Tez Task


Apache Knox Secure Access to Hadoop


Knox Initiative Make Hadoop security simple

Simplify security for both users

and operators.

Provide seamless access for users while securing cluster at

the perimeter, shielding the intricacies of the security

implementation.

Simplify Security

Deliver unified and centralized access to the Hadoop cluster.

Make Hadoop feel like a single

application to users.

Aggregate Access

Ensure service users are abstracted from where services are located and how services

are configured & scaled.

Client Agility


Knox: Make Hadoop Security Simple

Hadoop Cluster

Authentication & Verification

Client

User Store KDC, AD, LDAP

{REST}! Knox Gateway


Knox: Next Generation of Hadoop Security

•  All users see one end-point website

•  All online systems see one end-point RESTful service

•  Consistency across all interfaces and capabilities

•  Firewalled cluster that no end users need to access

•  More IT-friendly. Enables: – Systems admins – DB admins – Security admins – Network admins

Hadoop cluster

Gateway

firewallfirewall

end usersonline apps

+analytics tools


Apache Falcon Data Lifecycle Management for Hadoop


Data Lifecycle on Hadoop is Challenging

Data Management Needs Tools Data Processing Oozie Replication Sqoop Retention Distcp Scheduling Flume Reprocessing Map / Reduce Multi Cluster Management Hive and Pig Jobs

Problem: Patchwork of tools complicate data lifecycle management. Result: Long development cycles and quality challenges.


Falcon: One-stop Shop for Data Lifecycle

Apache Falcon Provides Orchestrates

Data Management Needs Tools Data Processing Oozie Replication Sqoop Retention Distcp Scheduling Flume Reprocessing Map / Reduce Multi Cluster Management Hive and Pig Jobs

Falcon provides a single interface to orchestrate data lifecycle. Sophisticated DLM easily added to Hadoop applications.


Falcon At A Glance

>  Falcon provides the key services data processing applications need. >  Complex data processing logic handled by Falcon instead of hard-coded in apps. >  Faster development and higher quality for ETL, reporting and other data

processing apps on Hadoop.

Data Processing Applications

Spec Files or REST APIs

Data Import and

Replication

Scheduling and

Coordination

Data Lifecycle Policies

Multi-Cluster Management

SLA Management

Falcon Data Lifecycle Management Service


Falcon Core Capabilities

• Core Functionality – Pipeline processing – Replication – Retention – Late data handling

• Automates – Scheduling and retry – Recording audit, lineage and metrics

• Operations and Management – Monitoring, management, metering – Alerts and notifications – Multi Cluster Federation

• CLI and REST API


Falcon Example: Multi-Cluster Failover

>  Falcon manages workflow, replication or both. >  Enables business continuity without requiring full data reprocessing. >  Failover clusters require less storage and CPU.

Staged Data

Cleansed Data

Conformed Data

Presented Data

Staged Data

Presented Data

BI and Analytics

Primary Hadoop Cluster

Failover Hadoop Cluster

Rep

licat

ion


>  Sophisticated retention policies expressed in one place. >  Simplify data retention for audit, compliance, or for data re-processing.

Falcon Example: Retention Policies

Staged Data

Retain 5 Years

Cleansed Data

Retain 3 Years

Conformed Data

Retain 3 Years

Presented Data

Retain Last Copy Only


Falcon Example: Late Data Handling

>  Processing waits until all data is available. >  Developers don’t write complex data handling rules within applications.

Online Transaction

Data (Pull via Sqoop)

Web Log Data (Push via FTP)

Staging Area Combined Dataset

Wait up to 4 hours for FTP data

to arrive


Multi Cluster Management with Prism

>  Prism is the part of Falcon that handles multi-cluster. >  Key use cases: Replication and data processing that spans clusters.


Hortonworks Sandbox Go from Zero to Big Data in 15 minutes


Sandbox: A Guided Tour of HDP

Tutorials and videos give a guided tour of HDP and Hadoop Perfect for beginners or anyone learning more about Hadoop Installs easily on your laptop or desktop

Browse and manage HDFS files

Easily import data and create tables

Easy-to-use editors for Apache Pig and Hive

Latest tutorials pushed directly to your Sandbox


THANK YOU! Chris Harris

[email protected]

Download Sandbox

hortonworks.com/sandbox

hadoop past, present and future

Technology