2013 nov 20 toronto hadoop user group (thug) - hadoop 2.2.0

59
© Hortonworks Inc. 2013. Confidential and Proprietary. Hadoop 2.2.0 Hadoop grows up Page 1 Adam Muise

Upload: adam-muise

Post on 26-Jan-2015

111 views

Category:

Technology


0 download

DESCRIPTION

Our Hadoop 2.2.0 Overview for the Toronto Hadoop User Group. Go THUG life.

TRANSCRIPT

Page 1: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

Hadoop 2.2.0 Hadoop grows up

Page 1

Adam Muise

Page 2: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

Rob Ford says…

Page 2

…turn off your #*@!#%!!! Mobile Phones!

Page 3: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

YARN Yet Another Resource Negotiator

Page 4: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

A new abstraction layer

HADOOP 1.0

HDFS  (redundant,  reliable  storage)  

MapReduce  (cluster  resource  management  

 &  data  processing)  

HDFS2  (redundant,  reliable  storage)  

YARN  (cluster  resource  management)  

MapReduce  (data  processing)  

Others  (data  processing)  

HADOOP 2.0

Single Use System Batch Apps

Multi Purpose Platform Batch, Interactive, Online, Streaming, …

Page 4

Page 5: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

Concepts

• Application – Application is a job submitted to the framework – Example – Map Reduce Job

• Container – Basic unit of allocation – Fine-grained resource allocation across multiple resource

types (memory, cpu, disk, network, gpu etc.) –  container_0 = 2GB, 1CPU

–  container_1 = 1GB, 6 CPU

– Replaces the fixed map/reduce slots

5

Page 6: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

YARN Architecture

• Resource Manager – Global resource scheduler – Hierarchical queues

• Node Manager – Per-machine agent – Manages the life-cycle of container – Container resource monitoring

• Application Master – Per-application – Manages application scheduling and task execution – E.g. MapReduce Application Master

6

Page 7: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2012

NodeManager   NodeManager   NodeManager   NodeManager  

Container  1.1  

Container  2.4  

NodeManager   NodeManager   NodeManager   NodeManager  

NodeManager   NodeManager   NodeManager   NodeManager  

Container  1.2  

Container  1.3  

AM  1  

Container  2.2  

Container  2.1  

Container  2.3  

AM2  

YARN Architecture - Walkthrough

Client2  

ResourceManager  

Scheduler  

Page 8: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2012

NodeManager   NodeManager   NodeManager   NodeManager  

map  1.1  

vertex1.2.2  

NodeManager   NodeManager   NodeManager   NodeManager  

NodeManager   NodeManager   NodeManager   NodeManager  

map1.2  

reduce1.1  

Batch  

vertex1.1.1  

vertex1.1.2  

vertex1.2.1  

InteracFve  SQL  

YARN as OS for Data Lake ResourceManager  

Scheduler  

Real-­‐Time  

nimbus0  

nimbus1  

nimbus2  

Page 9: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2012

Multi-Tenant YARN ResourceManager  

Scheduler  root

Adhoc 10%

DW 60%

Mrkting 30%

Dev 10%

Reserved 20%

Prod 70%

Prod 80%

Dev 20%

P0 70%

P1 30%

Page 10: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

Multi-Tenancy with New Capacity Scheduler

•  Queues •  Economics as queue-capacity

– Heirarchical Queues •  SLAs

– Preemption •  Resource Isolation

–  Linux: cgroups – MS Windows: Job Control – Roadmap: Virtualization (Xen, KVM)

•  Administration – Queue ACLs – Run-time re-configuration for queues – Charge-back

Page 10

ResourceManager  

Scheduler  

root

Adhoc 10%

DW 70%

Mrkting 20%

Dev 10%

Reserved 20%

Prod 70%

Prod 80%

Dev 20%

P0 70%

P1 30%

Capacity Scheduler

Hierarchical Queues

Page 11: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

MapReduce v2 Changes to MapReduce on YARN

Page 12: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

MapReduce V2 is a library now… •  MapReduce runs on YARN like all other Hadoop 2.x applications

– Gone are the map and reduce slots, that’s up to containers in YARN now – Gone is the JobTracker, replaced by the YARN AppMaster library

•  Multiple versions of MapReduce –  The older mapred APIs work without modification or recompilation –  The newer mapreduce APIs may need to be recompiled

•  Still has one master server component: the Job History Server –  The Job History Server stores the execution of jobs – Used to audit prior execution of jobs – Will also be used by YARN framework to store charge backs at that level

Page 12

Page 13: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

Shuffle in MapReduce v2 •  Faster Shuffle

– Better embedded server: Netty •  Encrypted Shuffle

– Secure the shuffle phase as data moves across the cluster – Requires 2 way HTTPS, certificates on both sides –  Incurs significant CPU overhead, reserve 1 core for this work – Certs stored on each node (provision with the cluster), refreshed every 10secs

•  Pluggable Shuffle Sort – Shuffle is the first phase in MapReduce that is guaranteed to not be data-local – Pluggable Shuffle/Sort allows for intrepid application developers or hardware

developers to intercept the network-heavy workload and optimize it –  Typical implementations have hardware components like fast networks and

software components like sorting algorithms – API will change with future versions of Hadoop

Page 13

Page 14: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

Efficiency Gains of MRv2

• Key Optimizations – No hard segmentation of resource into map and reduce slots – Yarn scheduler is more efficient – MRv2 framework has become more efficient than MRv1; shuffle phase in MRv2 is

more performant with the usage of netty.

• Yahoo has over 30000 nodes running YARN across over 365PB of data.

• They calculate running about 400,000 jobs per day for about 10 million hours of compute time.

• They also have estimated a 60% – 150% improvement on node usage per day.

• Yahoo got rid of a whole colo (10,000 node datacenter) because of their increased utilization.

Page 15: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

HDFS v2 In a NutShell

Page 16: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

HA

Page 16

Page 17: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

HDFS Snapshots: Feature Overview

•  Admin can create point in time snapshots of HDFS

– Of the entire file system (/root)

– Of a specific data-set (sub-tree directory of file system)

•  Restore state of entire file system or data-set to a snapshot (like Apple

Time Machine)

– Protect against user errors

•  Snapshot diffs identify changes made to data set

– Keep track of how raw or derived/analytical data changes over time

Page 17

Page 18: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

NFS Gateway: Feature Overview

•  NFS v3 standard

•  Supports all HDFS commands

–  List files

– Copy, move files

– Create and delete directories

•  Ingest for large scale analytical workloads

–  Load immutable files as source for analytical processing

– No random writes

•  Stream files into HDFS

–  Log ingest by applications writing directly to HDFS client mount

Page 19: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

Federation

Page 19

Page 20: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

Managing Namespaces

Page 20

Page 21: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

Performance

Page 21

Page 22: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

Other Features

Page 22

Page 23: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

Apache Tez A New Hadoop Data Processing Framework

Page 23

Page 24: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

Moving Hadoop Beyond MapReduce •  Low level data-processing execution engine •  Built on YARN

•  Enables pipelining of jobs •  Removes task and job launch times •  Does not write intermediate output to HDFS

– Much lighter disk and network usage

•  New base of MapReduce, Hive, Pig, Cascading etc. •  Hive and Pig jobs no longer need to move to the end of the queue

between steps in the pipeline

Page 25: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

Apache Tez as the new Primitive

HADOOP 1.0

HDFS  (redundant,  reliable  storage)  

MapReduce  (cluster  resource  management  

 &  data  processing)  

Pig  (data  flow)  

Hive  (sql)  

 Others  (cascading)  

 

HDFS2  (redundant,  reliable  storage)  

YARN  (cluster  resource  management)  

Tez  (execu:on  engine)  

HADOOP 2.0

Data  Flow  Pig  

SQL  Hive  

 Others  (cascading)  

 

Batch  MapReduce   Real  Time    

Stream    Processing  

Storm  

Online    Data    

Processing  HBase,  

Accumulo    

MapReduce as Base Apache Tez as Base

Page 26: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

Hive – MR Hive – Tez

Hive-on-MR vs. Hive-on-Tez SELECT a.x, AVERAGE(b.y) AS avg FROM a JOIN b ON (a.id = b.id) GROUP BY a UNION SELECT x, AVERAGE(y) AS AVG FROM c GROUP BY x

ORDER BY AVG;

SELECT a.state

JOIN (a, c) SELECT c.price

SELECT b.id

JOIN(a, b) GROUP BY a.state

COUNT(*) AVERAGE(c.price)

M M M

R R

M M

R

M M

R

M M

R

HDFS

HDFS

HDFS

M M M

R R

R

M M

R

R

SELECT a.state, c.itemId

JOIN (a, c)

JOIN(a, b) GROUP BY a.state

COUNT(*) AVERAGE(c.price)

SELECT b.id

Tez avoids unneeded writes to

HDFS

Page 27: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

Apache Tez (“Speed”) • Replaces MapReduce as primitive for Pig, Hive, Cascading etc.

– Smaller latency for interactive queries – Higher throughput for batch queries – 22 contributors: Hortonworks (13), Facebook, Twitter, Yahoo, Microsoft

YARN ApplicationMaster to run DAG of Tez Tasks

Task with pluggable Input, Processor and Output

Tez Task - <Input, Processor, Output>

Task  

Processor  Input   Output  

Page 28: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

Tez: Building blocks for scalable data processing

Classical ‘Map’ Classical ‘Reduce’

Intermediate ‘Reduce’ for Map-Reduce-Reduce

Map  Processor  

HDFS  Input  

Sorted  Output  

Reduce  Processor  

Shuffle  Input  

HDFS  Output  

Reduce  Processor  

Shuffle  Input  

Sorted  Output  

Page 29: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

Hive

29

Page 30: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

SQL: Enhancing SQL Semantics

Hive  SQL  Datatypes   Hive  SQL  SemanFcs  INT   SELECT,  INSERT  

TINYINT/SMALLINT/BIGINT   GROUP  BY,  ORDER  BY,  SORT  BY  

BOOLEAN   JOIN  on  explicit  join  key  

FLOAT   Inner,  outer,  cross  and  semi  joins  

DOUBLE   Sub-­‐queries  in  FROM  clause  

STRING   ROLLUP  and  CUBE  

TIMESTAMP   UNION  

BINARY   Windowing  Func:ons  (OVER,  RANK,  etc)  

DECIMAL   Custom  Java  UDFs  

ARRAY,  MAP,  STRUCT,  UNION   Standard  Aggrega:on  (SUM,  AVG,  etc.)  

DATE   Advanced  UDFs  (ngram,  Xpath,  URL)    

VARCHAR   Sub-­‐queries  in  WHERE,  HAVING  

CHAR   Expanded  JOIN  Syntax  

SQL  Compliant  Security  (GRANT,  etc.)  

INSERT/UPDATE/DELETE  (ACID)  

Hive  0.12  

Available  

Roadmap  

SQL Compliance Hive 12 provides a wide array of SQL datatypes and semantics so your existing tools integrate more seamlessly with Hadoop

Page 31: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

SPEED: Increasing Hive Performance

Performance Improvements included in Hive 12 –  Base & advanced query optimization –  Startup time improvement –  Join optimizations

Interactive Query Times across ALL use cases •  Simple and advanced queries in seconds •  Integrates seamlessly with existing tools •  Currently a >100x improvement in just nine months

Page 32: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

Apache Tez as the new Primitive

HADOOP 1.0

HDFS  (redundant,  reliable  storage)  

MapReduce  (cluster  resource  management  

 &  data  processing)  

Pig  (data  flow)  

Hive  (sql)  

 Others  (cascading)  

 

HDFS2  (redundant,  reliable  storage)  

YARN  (cluster  resource  management)  

Tez  (execu:on  engine)  

HADOOP 2.0

Data  Flow  Pig  

SQL  Hive  

 Others  (cascading)  

 

Batch  MapReduce   Real  Time    

Stream    Processing  

Storm  

Online    Data    

Processing  HBase,  

Accumulo    

MapReduce as Base Apache Tez as Base

Page 33: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

Hive – MR Hive – Tez

Hive-on-MR vs. Hive-on-Tez SELECT a.x, AVERAGE(b.y) AS avg FROM a JOIN b ON (a.id = b.id) GROUP BY a UNION SELECT x, AVERAGE(y) AS AVG FROM c GROUP BY x

ORDER BY AVG;

SELECT a.state

JOIN (a, c) SELECT c.price

SELECT b.id

JOIN(a, b) GROUP BY a.state

COUNT(*) AVERAGE(c.price)

M M M

R R

M M

R

M M

R

M M

R

HDFS

HDFS

HDFS

M M M

R R

R

M M

R

R

SELECT a.state, c.itemId

JOIN (a, c)

JOIN(a, b) GROUP BY a.state

COUNT(*) AVERAGE(c.price)

SELECT b.id

Tez avoids unneeded writes to

HDFS

Page 34: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2012

NodeManager   NodeManager   NodeManager   NodeManager  

map  1.1  vertex1.2.2  

NodeManager   NodeManager   NodeManager   NodeManager  

NodeManager   NodeManager   NodeManager   NodeManager  

map1.2  

reduce1.1  

Batch  

vertex1.1.1  

vertex1.1.2  

vertex1.2.1  

Hive/Tez  (SQL)  

Tez on YARN ResourceManager  

Scheduler  

Real-­‐Time  

nimbus0  

nimbus1  

nimbus2  

Page 35: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

Apache Falcon Data Lifecycle Management for Hadoop

Page 36: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

Data Lifecycle on Hadoop is Challenging

Data Management Needs Tools Data Processing Oozie Replication Sqoop Retention Distcp Scheduling Flume Reprocessing Map / Reduce Multi Cluster Management Hive and Pig Jobs

Problem: Patchwork of tools complicate data lifecycle management. Result: Long development cycles and quality challenges.

Page 37: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

Falcon: One-stop Shop for Data Lifecycle

Apache Falcon Provides Orchestrates

Data Management Needs Tools Data Processing Oozie Replication Sqoop Retention Distcp Scheduling Flume Reprocessing Map / Reduce Multi Cluster Management Hive and Pig Jobs

Falcon provides a single interface to orchestrate data lifecycle. Sophisticated DLM easily added to Hadoop applications.

Page 38: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

Falcon Core Capabilities •  Core Functionality

– Pipeline processing – Replication – Retention –  Late data handling

•  Automates – Scheduling and retry – Recording audit, lineage and metrics

•  Operations and Management – Monitoring, management, metering – Alerts and notifications – Multi Cluster Federation

•  CLI and REST API

Page 39: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

Falcon At A Glance

>  Falcon offers a high-level abstraction of key services for Hadoop data management needs. >  Complex data processing logic is handled by Falcon instead of hard-coded in data processing apps. >  Falcon enables faster development of ETL, reporting and other data processing apps on Hadoop.

Data Processing Applications

Data Import and

Replication

Scheduling and

Coordination

Data Lifecycle Policies

Multi-Cluster Management

SLA Management

Falcon Data Management Framework

Page 40: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

>  Falcon manages workflow and replication. >  Enables business continuity without requiring full data representation. >  Failover clusters can be smaller than primary clusters.

Falcon Example: Replication

Staged Data

Staged Data

Cleansed Data

Access Data

Processed Data

Conformed Data

Rep

licat

ion

Rep

licat

ion

Page 41: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

>  Sophisticated retention policies expressed in one place. >  Simplify data retention for audit, compliance, or for data re-processing.

Falcon Example: Retention

Staged Data

Retain 20 Years

Cleansed Data

Retain 3 Years

Conformed Data

Retain 3 Years

Access Data

Retain Last Copy Only

Page 42: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

Falcon Example: Late Data Handling

>  Processing waits until all required input data is available. >  Checks for late data arrivals, issues retrigger processing as necessary. >  Eliminates writing complex data handling rules within applications.

Online Transaction

Data (via Sqoop)

Web Log Data (via FTP)

Staged Data Combined Dataset

Wait up to 4 hours for FTP data

to arrive

Page 43: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

Examples

Page 43

Page 44: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

Example: Cluster Specification

<?xml version="1.0"?>!<!--!

My Local Cluster specification! -->!

<cluster colo=”my-local-cluster" description="" name="cluster-alpha"> ! <interfaces>!

<interface type="readonly" endpoint="hftp://nn:50070" version="2.2.0" />! <interface type="write" endpoint="hdfs://nn:8020" version="2.2.0" />!

<interface type="execute" endpoint=”rm:8050" version="2.2.0" />! <interface type="workflow" endpoint="http://os:11000/oozie/" version="4.0.0" />!

<interface type="messaging" endpoint="tcp://mq:61616?daemon=true" version="5.1.6" />! </interfaces>!

<locations>! <location name="staging" path="/apps/falcon/cluster-alpha/staging" />!

<location name="temp" path="/tmp" />! <location name="working" path="/apps/falcon/cluster-alpha/working" />!

</locations>!</cluster>!

Page 44

NameNode

Resource Manager

Oozie Server

readonly!

write!

execute!

workflow!

Page 45: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

Example: Weblogs Replication and Retention

Page 45

Page 46: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

Example 1: Weblogs •  Weblogs land hourly in my primary cluster •  HDFS location is /weblogs/{date}

•  I want to: – Evict weblogs from primary cluster after 1 day

Page 46

Page 47: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

Feed Specification 1: Weblogs

Page 47

<feed description="" name="feed-weblogs1" xmlns="uri:falcon:feed:0.1” >! <frequency>hours(1)</frequency>! ! <clusters>!

!<cluster name="cluster-primary" type="source”>!! <validity start="2013-10-24T00:00Z" end="2014-12-31T00:00Z"/>!! <retention limit="days(1)" action="delete"/>!!</cluster>!

</clusters>!! <locations>!

!<location type="data" path="/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR}" />! </locations>!! <ACL owner="hdfs" group="users" permission="0755" />! <schema location="/none" provider="none"/>!</feed>!

Location of the data

Cluster where data is located

Retention policy 1 day

Page 48: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

Example 2: Weblogs •  Weblogs land hourly in my primary cluster •  HDFS location is /weblogs/{date}

•  I want to: – Replicate weblogs to my secondary cluster – Evict weblogs from primary cluster after 2 days – Evict weblogs from secondary cluster after 1 week

Page 48

Page 49: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

Feed Specification 2: Weblogs

<feed description=“" name=”feed-weblogs2” xmlns="uri:falcon:feed:0.1">! <frequency>hours(1)</frequency>!! <clusters>!

<cluster name=”cluster-primary" type="source">! <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/>!

<retention limit="days(2)" action="delete"/>! </cluster>! <cluster name=”cluster-secondary" type="target">!

<validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/>! <retention limit=”days(7)" action="delete"/>!

</cluster>! </clusters>!!

<locations>! <location type="data” path="/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/>! </locations>!

! <ACL owner=”hdfs" group="users" permission="0755"/>!

<schema location="/none" provider="none"/>!</feed>!

Location of the data

Cluster where data is located

Retention policy 2 days

Cluster where data will be replicated

Retention policy 1 week

Page 50: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

Example 3: Weblogs •  Weblogs land hourly in my primary cluster •  HDFS location is /weblogs/{date}

•  I want to: – Replicate weblogs to a discovery cluster – Replicate weblogs to a BCP cluster – Evict weblogs from primary cluster after 2 days – Evict weblogs from discovery cluster after 1 week – Evict weblogs from BCP cluster after 3 months

Page 50

Page 51: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

Feed Specification 3: Weblogs <feed description=“” name=”feed-weblogs” xmlns="uri:falcon:feed:0.1">! <frequency>hours(1)</frequency>!

! <clusters>!

<cluster name=”cluster-primary" type="source">! <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/>!

<retention limit="days(2)" action="delete"/>! </cluster>!

<cluster name=“cluster-discovery" type="target">! <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/>!

<retention limit=”days(7)" action="delete"/>! <locations>!

<location type="data” path="/projects/recommendations/${YEAR}-${MONTH}-${DAY}-${HOUR} "/>! </locations>!

</cluster>! <cluster name=”cluster-bcp" type="target">!

<validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/>! <retention limit=”months(3)" action="delete"/>!

<locations>! <location type="data” path="/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/>!

</locations>!

</cluster>! </clusters>!

! <locations>!

<location type="data” path="/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/>! </locations>!

! <ACL owner=”hdfs" group="users" permission="0755"/>!

<schema location="/none" provider="none"/>!</feed>!

Cluster specific location

Cluster specific location

Page 52: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

Apache Knox Secure Access to Hadoop

Page 53: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

Connecting to the Cluster..Edge Nodes •  What is an Edge Node?

– Nodes in a DMZ zone that has access to the cluster. Only way to access the cluster

– Hadoop client Apis and MR/Pig/Hive jobs would be executed from these edge nodes.

– Users SSH to Edge Node and upload all job artifacts and then execute API/Commands commands from shell

Page 53

Hadoop User Edge Node

SSH!

• Challenges – SSH, Edge Node, and job maintenance nightmare – Difficult to integrate with Applications

Page 54: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

Connecting to the Cluster..REST API

•  Useful for connecting to Hadoop from the outside the cluster •  When more client language flexibility is required

–  i.e. Java binding not an option

• Challenges – Client must have knowledge of cluster topology – Required to open ports (and in some cases, on every host) outside the cluster

Page 54

Service API WebHDFS Supports HDFS user operations including reading files,

writing to files, making directories, changing permissions and renaming. Learn more about WebHDFS.

WebHCat Job control for MapReduce, Pig and Hive jobs, and HCatalog DDL commands. Learn more about WebHCat.

Oozie Job submission and management, and Oozie administration. Learn more about Oozie.

Page 55: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

Apache Knox Gateway – Perimeter Security

Page 55

Simplified Access

•  Single Hadoop access point •  Rationalized REST API hierarchy •  Consolidated API calls •  Multi-cluster support •  Client DSL

Centralized Security

•  Eliminate SSH “edge node” •  LDAP and ActiveDirectory auth •  Central API management + audit

Page 56: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

Knox Gateway Network Architecture

Page 56

Ambari Server/Hue Server

Kerberos/Enterprise

Identity Provider

Enterprise/Cloud SSO

Provider

Identity Providers

Knox Gateway Cluster

GW GW GW

DMZ

A stateless cluster of reverse proxy instances deployed in DMZ

Firewall

Secure Hadoop Cluster 1

Masters

JT NN Web HCat Oozie

YARN HBase Hive

DN TT

Secure Hadoop Cluster 2

Masters

JT NN Web HCat Oozie

YARN HBase Hive

DN TT -Requests streamed through GW to Hadoop services after auth. -URLs rewritten to refer to gateway

Firewall

REST Client

JDBC Client

Ambari Client

Browser

Page 57: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

Wot no 2.2.0? Where can I get the Hadoop 2.2.0 fix?

Page 57

Page 58: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

Like the Truth, Hadoop 2.2.0 is out there…

Page 58

Component HDP2.0 CDH4 CDH5 Beta

Intel IDH3.0

MapR 3 IBM Big Insights 2.1

Hadoop Common

2.2.0 2.0.0 2.2.0 2.0.4 N/A 1.1.1

Hive + HCatalog

0.12 0.10 + 0.5

0.11 0.10 + 0.5 0.11 0.9 + 0.4

Pig 0.12 0.11 0.11 0.10 0.11 0.10

Mahout 0.8 0.7 0.8 0.8 0.8 N/A

Flume 1.4.0 1.4.0 1.4.0 1.3.0 1.4.0 1.3.0

Oozie 4.0.0 3.3.2 4.0.0 3.3.0 3.3.2 3.2.0

Sqoop 1.4.4 1.4.3 1.4.4 1.4.3 1.4.4 1.4.2

HBase 0.96.0 0.94.6 95.2 0.94.7 94.9 0.94.3

Page 59: 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

© Hortonworks Inc. 2013. Confidential and Proprietary.

Thank You THUG Life