2013 nov 20 toronto hadoop user group (thug) - hadoop 2.2.0

Post on 26-Jan-2015

111 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Our Hadoop 2.2.0 Overview for the Toronto Hadoop User Group. Go THUG life.

TRANSCRIPT

© Hortonworks Inc. 2013. Confidential and Proprietary.

Hadoop 2.2.0 Hadoop grows up

Page 1

Adam Muise

© Hortonworks Inc. 2013. Confidential and Proprietary.

Rob Ford says…

Page 2

…turn off your #*@!#%!!! Mobile Phones!

© Hortonworks Inc. 2013. Confidential and Proprietary.

YARN Yet Another Resource Negotiator

© Hortonworks Inc. 2013. Confidential and Proprietary.

A new abstraction layer

HADOOP 1.0

HDFS  (redundant,  reliable  storage)  

MapReduce  (cluster  resource  management  

 &  data  processing)  

HDFS2  (redundant,  reliable  storage)  

YARN  (cluster  resource  management)  

MapReduce  (data  processing)  

Others  (data  processing)  

HADOOP 2.0

Single Use System Batch Apps

Multi Purpose Platform Batch, Interactive, Online, Streaming, …

Page 4

© Hortonworks Inc. 2013. Confidential and Proprietary.

Concepts

• Application – Application is a job submitted to the framework – Example – Map Reduce Job

• Container – Basic unit of allocation – Fine-grained resource allocation across multiple resource

types (memory, cpu, disk, network, gpu etc.) –  container_0 = 2GB, 1CPU

–  container_1 = 1GB, 6 CPU

– Replaces the fixed map/reduce slots

5

© Hortonworks Inc. 2013. Confidential and Proprietary.

YARN Architecture

• Resource Manager – Global resource scheduler – Hierarchical queues

• Node Manager – Per-machine agent – Manages the life-cycle of container – Container resource monitoring

• Application Master – Per-application – Manages application scheduling and task execution – E.g. MapReduce Application Master

6

© Hortonworks Inc. 2012

NodeManager   NodeManager   NodeManager   NodeManager  

Container  1.1  

Container  2.4  

NodeManager   NodeManager   NodeManager   NodeManager  

NodeManager   NodeManager   NodeManager   NodeManager  

Container  1.2  

Container  1.3  

AM  1  

Container  2.2  

Container  2.1  

Container  2.3  

AM2  

YARN Architecture - Walkthrough

Client2  

ResourceManager  

Scheduler  

© Hortonworks Inc. 2012

NodeManager   NodeManager   NodeManager   NodeManager  

map  1.1  

vertex1.2.2  

NodeManager   NodeManager   NodeManager   NodeManager  

NodeManager   NodeManager   NodeManager   NodeManager  

map1.2  

reduce1.1  

Batch  

vertex1.1.1  

vertex1.1.2  

vertex1.2.1  

InteracFve  SQL  

YARN as OS for Data Lake ResourceManager  

Scheduler  

Real-­‐Time  

nimbus0  

nimbus1  

nimbus2  

© Hortonworks Inc. 2012

Multi-Tenant YARN ResourceManager  

Scheduler  root

Adhoc 10%

DW 60%

Mrkting 30%

Dev 10%

Reserved 20%

Prod 70%

Prod 80%

Dev 20%

P0 70%

P1 30%

© Hortonworks Inc. 2013. Confidential and Proprietary.

Multi-Tenancy with New Capacity Scheduler

•  Queues •  Economics as queue-capacity

– Heirarchical Queues •  SLAs

– Preemption •  Resource Isolation

–  Linux: cgroups – MS Windows: Job Control – Roadmap: Virtualization (Xen, KVM)

•  Administration – Queue ACLs – Run-time re-configuration for queues – Charge-back

Page 10

ResourceManager  

Scheduler  

root

Adhoc 10%

DW 70%

Mrkting 20%

Dev 10%

Reserved 20%

Prod 70%

Prod 80%

Dev 20%

P0 70%

P1 30%

Capacity Scheduler

Hierarchical Queues

© Hortonworks Inc. 2013. Confidential and Proprietary.

MapReduce v2 Changes to MapReduce on YARN

© Hortonworks Inc. 2013. Confidential and Proprietary.

MapReduce V2 is a library now… •  MapReduce runs on YARN like all other Hadoop 2.x applications

– Gone are the map and reduce slots, that’s up to containers in YARN now – Gone is the JobTracker, replaced by the YARN AppMaster library

•  Multiple versions of MapReduce –  The older mapred APIs work without modification or recompilation –  The newer mapreduce APIs may need to be recompiled

•  Still has one master server component: the Job History Server –  The Job History Server stores the execution of jobs – Used to audit prior execution of jobs – Will also be used by YARN framework to store charge backs at that level

Page 12

© Hortonworks Inc. 2013. Confidential and Proprietary.

Shuffle in MapReduce v2 •  Faster Shuffle

– Better embedded server: Netty •  Encrypted Shuffle

– Secure the shuffle phase as data moves across the cluster – Requires 2 way HTTPS, certificates on both sides –  Incurs significant CPU overhead, reserve 1 core for this work – Certs stored on each node (provision with the cluster), refreshed every 10secs

•  Pluggable Shuffle Sort – Shuffle is the first phase in MapReduce that is guaranteed to not be data-local – Pluggable Shuffle/Sort allows for intrepid application developers or hardware

developers to intercept the network-heavy workload and optimize it –  Typical implementations have hardware components like fast networks and

software components like sorting algorithms – API will change with future versions of Hadoop

Page 13

© Hortonworks Inc. 2013. Confidential and Proprietary.

Efficiency Gains of MRv2

• Key Optimizations – No hard segmentation of resource into map and reduce slots – Yarn scheduler is more efficient – MRv2 framework has become more efficient than MRv1; shuffle phase in MRv2 is

more performant with the usage of netty.

• Yahoo has over 30000 nodes running YARN across over 365PB of data.

• They calculate running about 400,000 jobs per day for about 10 million hours of compute time.

• They also have estimated a 60% – 150% improvement on node usage per day.

• Yahoo got rid of a whole colo (10,000 node datacenter) because of their increased utilization.

© Hortonworks Inc. 2013. Confidential and Proprietary.

HDFS v2 In a NutShell

© Hortonworks Inc. 2013. Confidential and Proprietary.

HA

Page 16

© Hortonworks Inc. 2013. Confidential and Proprietary.

HDFS Snapshots: Feature Overview

•  Admin can create point in time snapshots of HDFS

– Of the entire file system (/root)

– Of a specific data-set (sub-tree directory of file system)

•  Restore state of entire file system or data-set to a snapshot (like Apple

Time Machine)

– Protect against user errors

•  Snapshot diffs identify changes made to data set

– Keep track of how raw or derived/analytical data changes over time

Page 17

© Hortonworks Inc. 2013. Confidential and Proprietary.

NFS Gateway: Feature Overview

•  NFS v3 standard

•  Supports all HDFS commands

–  List files

– Copy, move files

– Create and delete directories

•  Ingest for large scale analytical workloads

–  Load immutable files as source for analytical processing

– No random writes

•  Stream files into HDFS

–  Log ingest by applications writing directly to HDFS client mount

© Hortonworks Inc. 2013. Confidential and Proprietary.

Federation

Page 19

© Hortonworks Inc. 2013. Confidential and Proprietary.

Managing Namespaces

Page 20

© Hortonworks Inc. 2013. Confidential and Proprietary.

Performance

Page 21

© Hortonworks Inc. 2013. Confidential and Proprietary.

Other Features

Page 22

© Hortonworks Inc. 2013. Confidential and Proprietary.

Apache Tez A New Hadoop Data Processing Framework

Page 23

© Hortonworks Inc. 2013. Confidential and Proprietary.

Moving Hadoop Beyond MapReduce •  Low level data-processing execution engine •  Built on YARN

•  Enables pipelining of jobs •  Removes task and job launch times •  Does not write intermediate output to HDFS

– Much lighter disk and network usage

•  New base of MapReduce, Hive, Pig, Cascading etc. •  Hive and Pig jobs no longer need to move to the end of the queue

between steps in the pipeline

© Hortonworks Inc. 2013. Confidential and Proprietary.

Apache Tez as the new Primitive

HADOOP 1.0

HDFS  (redundant,  reliable  storage)  

MapReduce  (cluster  resource  management  

 &  data  processing)  

Pig  (data  flow)  

Hive  (sql)  

 Others  (cascading)  

 

HDFS2  (redundant,  reliable  storage)  

YARN  (cluster  resource  management)  

Tez  (execu:on  engine)  

HADOOP 2.0

Data  Flow  Pig  

SQL  Hive  

 Others  (cascading)  

 

Batch  MapReduce   Real  Time    

Stream    Processing  

Storm  

Online    Data    

Processing  HBase,  

Accumulo    

MapReduce as Base Apache Tez as Base

© Hortonworks Inc. 2013. Confidential and Proprietary.

Hive – MR Hive – Tez

Hive-on-MR vs. Hive-on-Tez SELECT a.x, AVERAGE(b.y) AS avg FROM a JOIN b ON (a.id = b.id) GROUP BY a UNION SELECT x, AVERAGE(y) AS AVG FROM c GROUP BY x

ORDER BY AVG;

SELECT a.state

JOIN (a, c) SELECT c.price

SELECT b.id

JOIN(a, b) GROUP BY a.state

COUNT(*) AVERAGE(c.price)

M M M

R R

M M

R

M M

R

M M

R

HDFS

HDFS

HDFS

M M M

R R

R

M M

R

R

SELECT a.state, c.itemId

JOIN (a, c)

JOIN(a, b) GROUP BY a.state

COUNT(*) AVERAGE(c.price)

SELECT b.id

Tez avoids unneeded writes to

HDFS

© Hortonworks Inc. 2013. Confidential and Proprietary.

Apache Tez (“Speed”) • Replaces MapReduce as primitive for Pig, Hive, Cascading etc.

– Smaller latency for interactive queries – Higher throughput for batch queries – 22 contributors: Hortonworks (13), Facebook, Twitter, Yahoo, Microsoft

YARN ApplicationMaster to run DAG of Tez Tasks

Task with pluggable Input, Processor and Output

Tez Task - <Input, Processor, Output>

Task  

Processor  Input   Output  

© Hortonworks Inc. 2013. Confidential and Proprietary.

Tez: Building blocks for scalable data processing

Classical ‘Map’ Classical ‘Reduce’

Intermediate ‘Reduce’ for Map-Reduce-Reduce

Map  Processor  

HDFS  Input  

Sorted  Output  

Reduce  Processor  

Shuffle  Input  

HDFS  Output  

Reduce  Processor  

Shuffle  Input  

Sorted  Output  

© Hortonworks Inc. 2013. Confidential and Proprietary.

Hive

29

© Hortonworks Inc. 2013. Confidential and Proprietary.

SQL: Enhancing SQL Semantics

Hive  SQL  Datatypes   Hive  SQL  SemanFcs  INT   SELECT,  INSERT  

TINYINT/SMALLINT/BIGINT   GROUP  BY,  ORDER  BY,  SORT  BY  

BOOLEAN   JOIN  on  explicit  join  key  

FLOAT   Inner,  outer,  cross  and  semi  joins  

DOUBLE   Sub-­‐queries  in  FROM  clause  

STRING   ROLLUP  and  CUBE  

TIMESTAMP   UNION  

BINARY   Windowing  Func:ons  (OVER,  RANK,  etc)  

DECIMAL   Custom  Java  UDFs  

ARRAY,  MAP,  STRUCT,  UNION   Standard  Aggrega:on  (SUM,  AVG,  etc.)  

DATE   Advanced  UDFs  (ngram,  Xpath,  URL)    

VARCHAR   Sub-­‐queries  in  WHERE,  HAVING  

CHAR   Expanded  JOIN  Syntax  

SQL  Compliant  Security  (GRANT,  etc.)  

INSERT/UPDATE/DELETE  (ACID)  

Hive  0.12  

Available  

Roadmap  

SQL Compliance Hive 12 provides a wide array of SQL datatypes and semantics so your existing tools integrate more seamlessly with Hadoop

© Hortonworks Inc. 2013. Confidential and Proprietary.

SPEED: Increasing Hive Performance

Performance Improvements included in Hive 12 –  Base & advanced query optimization –  Startup time improvement –  Join optimizations

Interactive Query Times across ALL use cases •  Simple and advanced queries in seconds •  Integrates seamlessly with existing tools •  Currently a >100x improvement in just nine months

© Hortonworks Inc. 2013. Confidential and Proprietary.

Apache Tez as the new Primitive

HADOOP 1.0

HDFS  (redundant,  reliable  storage)  

MapReduce  (cluster  resource  management  

 &  data  processing)  

Pig  (data  flow)  

Hive  (sql)  

 Others  (cascading)  

 

HDFS2  (redundant,  reliable  storage)  

YARN  (cluster  resource  management)  

Tez  (execu:on  engine)  

HADOOP 2.0

Data  Flow  Pig  

SQL  Hive  

 Others  (cascading)  

 

Batch  MapReduce   Real  Time    

Stream    Processing  

Storm  

Online    Data    

Processing  HBase,  

Accumulo    

MapReduce as Base Apache Tez as Base

© Hortonworks Inc. 2013. Confidential and Proprietary.

Hive – MR Hive – Tez

Hive-on-MR vs. Hive-on-Tez SELECT a.x, AVERAGE(b.y) AS avg FROM a JOIN b ON (a.id = b.id) GROUP BY a UNION SELECT x, AVERAGE(y) AS AVG FROM c GROUP BY x

ORDER BY AVG;

SELECT a.state

JOIN (a, c) SELECT c.price

SELECT b.id

JOIN(a, b) GROUP BY a.state

COUNT(*) AVERAGE(c.price)

M M M

R R

M M

R

M M

R

M M

R

HDFS

HDFS

HDFS

M M M

R R

R

M M

R

R

SELECT a.state, c.itemId

JOIN (a, c)

JOIN(a, b) GROUP BY a.state

COUNT(*) AVERAGE(c.price)

SELECT b.id

Tez avoids unneeded writes to

HDFS

© Hortonworks Inc. 2012

NodeManager   NodeManager   NodeManager   NodeManager  

map  1.1  vertex1.2.2  

NodeManager   NodeManager   NodeManager   NodeManager  

NodeManager   NodeManager   NodeManager   NodeManager  

map1.2  

reduce1.1  

Batch  

vertex1.1.1  

vertex1.1.2  

vertex1.2.1  

Hive/Tez  (SQL)  

Tez on YARN ResourceManager  

Scheduler  

Real-­‐Time  

nimbus0  

nimbus1  

nimbus2  

© Hortonworks Inc. 2013. Confidential and Proprietary.

Apache Falcon Data Lifecycle Management for Hadoop

© Hortonworks Inc. 2013. Confidential and Proprietary.

Data Lifecycle on Hadoop is Challenging

Data Management Needs Tools Data Processing Oozie Replication Sqoop Retention Distcp Scheduling Flume Reprocessing Map / Reduce Multi Cluster Management Hive and Pig Jobs

Problem: Patchwork of tools complicate data lifecycle management. Result: Long development cycles and quality challenges.

© Hortonworks Inc. 2013. Confidential and Proprietary.

Falcon: One-stop Shop for Data Lifecycle

Apache Falcon Provides Orchestrates

Data Management Needs Tools Data Processing Oozie Replication Sqoop Retention Distcp Scheduling Flume Reprocessing Map / Reduce Multi Cluster Management Hive and Pig Jobs

Falcon provides a single interface to orchestrate data lifecycle. Sophisticated DLM easily added to Hadoop applications.

© Hortonworks Inc. 2013. Confidential and Proprietary.

Falcon Core Capabilities •  Core Functionality

– Pipeline processing – Replication – Retention –  Late data handling

•  Automates – Scheduling and retry – Recording audit, lineage and metrics

•  Operations and Management – Monitoring, management, metering – Alerts and notifications – Multi Cluster Federation

•  CLI and REST API

© Hortonworks Inc. 2013. Confidential and Proprietary.

Falcon At A Glance

>  Falcon offers a high-level abstraction of key services for Hadoop data management needs. >  Complex data processing logic is handled by Falcon instead of hard-coded in data processing apps. >  Falcon enables faster development of ETL, reporting and other data processing apps on Hadoop.

Data Processing Applications

Data Import and

Replication

Scheduling and

Coordination

Data Lifecycle Policies

Multi-Cluster Management

SLA Management

Falcon Data Management Framework

© Hortonworks Inc. 2013. Confidential and Proprietary.

>  Falcon manages workflow and replication. >  Enables business continuity without requiring full data representation. >  Failover clusters can be smaller than primary clusters.

Falcon Example: Replication

Staged Data

Staged Data

Cleansed Data

Access Data

Processed Data

Conformed Data

Rep

licat

ion

Rep

licat

ion

© Hortonworks Inc. 2013. Confidential and Proprietary.

>  Sophisticated retention policies expressed in one place. >  Simplify data retention for audit, compliance, or for data re-processing.

Falcon Example: Retention

Staged Data

Retain 20 Years

Cleansed Data

Retain 3 Years

Conformed Data

Retain 3 Years

Access Data

Retain Last Copy Only

© Hortonworks Inc. 2013. Confidential and Proprietary.

Falcon Example: Late Data Handling

>  Processing waits until all required input data is available. >  Checks for late data arrivals, issues retrigger processing as necessary. >  Eliminates writing complex data handling rules within applications.

Online Transaction

Data (via Sqoop)

Web Log Data (via FTP)

Staged Data Combined Dataset

Wait up to 4 hours for FTP data

to arrive

© Hortonworks Inc. 2013. Confidential and Proprietary.

Examples

Page 43

© Hortonworks Inc. 2013. Confidential and Proprietary.

Example: Cluster Specification

<?xml version="1.0"?>!<!--!

My Local Cluster specification! -->!

<cluster colo=”my-local-cluster" description="" name="cluster-alpha"> ! <interfaces>!

<interface type="readonly" endpoint="hftp://nn:50070" version="2.2.0" />! <interface type="write" endpoint="hdfs://nn:8020" version="2.2.0" />!

<interface type="execute" endpoint=”rm:8050" version="2.2.0" />! <interface type="workflow" endpoint="http://os:11000/oozie/" version="4.0.0" />!

<interface type="messaging" endpoint="tcp://mq:61616?daemon=true" version="5.1.6" />! </interfaces>!

<locations>! <location name="staging" path="/apps/falcon/cluster-alpha/staging" />!

<location name="temp" path="/tmp" />! <location name="working" path="/apps/falcon/cluster-alpha/working" />!

</locations>!</cluster>!

Page 44

NameNode

Resource Manager

Oozie Server

readonly!

write!

execute!

workflow!

© Hortonworks Inc. 2013. Confidential and Proprietary.

Example: Weblogs Replication and Retention

Page 45

© Hortonworks Inc. 2013. Confidential and Proprietary.

Example 1: Weblogs •  Weblogs land hourly in my primary cluster •  HDFS location is /weblogs/{date}

•  I want to: – Evict weblogs from primary cluster after 1 day

Page 46

© Hortonworks Inc. 2013. Confidential and Proprietary.

Feed Specification 1: Weblogs

Page 47

<feed description="" name="feed-weblogs1" xmlns="uri:falcon:feed:0.1” >! <frequency>hours(1)</frequency>! ! <clusters>!

!<cluster name="cluster-primary" type="source”>!! <validity start="2013-10-24T00:00Z" end="2014-12-31T00:00Z"/>!! <retention limit="days(1)" action="delete"/>!!</cluster>!

</clusters>!! <locations>!

!<location type="data" path="/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR}" />! </locations>!! <ACL owner="hdfs" group="users" permission="0755" />! <schema location="/none" provider="none"/>!</feed>!

Location of the data

Cluster where data is located

Retention policy 1 day

© Hortonworks Inc. 2013. Confidential and Proprietary.

Example 2: Weblogs •  Weblogs land hourly in my primary cluster •  HDFS location is /weblogs/{date}

•  I want to: – Replicate weblogs to my secondary cluster – Evict weblogs from primary cluster after 2 days – Evict weblogs from secondary cluster after 1 week

Page 48

© Hortonworks Inc. 2013. Confidential and Proprietary.

Feed Specification 2: Weblogs

<feed description=“" name=”feed-weblogs2” xmlns="uri:falcon:feed:0.1">! <frequency>hours(1)</frequency>!! <clusters>!

<cluster name=”cluster-primary" type="source">! <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/>!

<retention limit="days(2)" action="delete"/>! </cluster>! <cluster name=”cluster-secondary" type="target">!

<validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/>! <retention limit=”days(7)" action="delete"/>!

</cluster>! </clusters>!!

<locations>! <location type="data” path="/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/>! </locations>!

! <ACL owner=”hdfs" group="users" permission="0755"/>!

<schema location="/none" provider="none"/>!</feed>!

Location of the data

Cluster where data is located

Retention policy 2 days

Cluster where data will be replicated

Retention policy 1 week

© Hortonworks Inc. 2013. Confidential and Proprietary.

Example 3: Weblogs •  Weblogs land hourly in my primary cluster •  HDFS location is /weblogs/{date}

•  I want to: – Replicate weblogs to a discovery cluster – Replicate weblogs to a BCP cluster – Evict weblogs from primary cluster after 2 days – Evict weblogs from discovery cluster after 1 week – Evict weblogs from BCP cluster after 3 months

Page 50

© Hortonworks Inc. 2013. Confidential and Proprietary.

Feed Specification 3: Weblogs <feed description=“” name=”feed-weblogs” xmlns="uri:falcon:feed:0.1">! <frequency>hours(1)</frequency>!

! <clusters>!

<cluster name=”cluster-primary" type="source">! <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/>!

<retention limit="days(2)" action="delete"/>! </cluster>!

<cluster name=“cluster-discovery" type="target">! <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/>!

<retention limit=”days(7)" action="delete"/>! <locations>!

<location type="data” path="/projects/recommendations/${YEAR}-${MONTH}-${DAY}-${HOUR} "/>! </locations>!

</cluster>! <cluster name=”cluster-bcp" type="target">!

<validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/>! <retention limit=”months(3)" action="delete"/>!

<locations>! <location type="data” path="/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/>!

</locations>!

</cluster>! </clusters>!

! <locations>!

<location type="data” path="/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/>! </locations>!

! <ACL owner=”hdfs" group="users" permission="0755"/>!

<schema location="/none" provider="none"/>!</feed>!

Cluster specific location

Cluster specific location

© Hortonworks Inc. 2013. Confidential and Proprietary.

Apache Knox Secure Access to Hadoop

© Hortonworks Inc. 2013. Confidential and Proprietary.

Connecting to the Cluster..Edge Nodes •  What is an Edge Node?

– Nodes in a DMZ zone that has access to the cluster. Only way to access the cluster

– Hadoop client Apis and MR/Pig/Hive jobs would be executed from these edge nodes.

– Users SSH to Edge Node and upload all job artifacts and then execute API/Commands commands from shell

Page 53

Hadoop User Edge Node

SSH!

• Challenges – SSH, Edge Node, and job maintenance nightmare – Difficult to integrate with Applications

© Hortonworks Inc. 2013. Confidential and Proprietary.

Connecting to the Cluster..REST API

•  Useful for connecting to Hadoop from the outside the cluster •  When more client language flexibility is required

–  i.e. Java binding not an option

• Challenges – Client must have knowledge of cluster topology – Required to open ports (and in some cases, on every host) outside the cluster

Page 54

Service API WebHDFS Supports HDFS user operations including reading files,

writing to files, making directories, changing permissions and renaming. Learn more about WebHDFS.

WebHCat Job control for MapReduce, Pig and Hive jobs, and HCatalog DDL commands. Learn more about WebHCat.

Oozie Job submission and management, and Oozie administration. Learn more about Oozie.

© Hortonworks Inc. 2013. Confidential and Proprietary.

Apache Knox Gateway – Perimeter Security

Page 55

Simplified Access

•  Single Hadoop access point •  Rationalized REST API hierarchy •  Consolidated API calls •  Multi-cluster support •  Client DSL

Centralized Security

•  Eliminate SSH “edge node” •  LDAP and ActiveDirectory auth •  Central API management + audit

© Hortonworks Inc. 2013. Confidential and Proprietary.

Knox Gateway Network Architecture

Page 56

Ambari Server/Hue Server

Kerberos/Enterprise

Identity Provider

Enterprise/Cloud SSO

Provider

Identity Providers

Knox Gateway Cluster

GW GW GW

DMZ

A stateless cluster of reverse proxy instances deployed in DMZ

Firewall

Secure Hadoop Cluster 1

Masters

JT NN Web HCat Oozie

YARN HBase Hive

DN TT

Secure Hadoop Cluster 2

Masters

JT NN Web HCat Oozie

YARN HBase Hive

DN TT -Requests streamed through GW to Hadoop services after auth. -URLs rewritten to refer to gateway

Firewall

REST Client

JDBC Client

Ambari Client

Browser

© Hortonworks Inc. 2013. Confidential and Proprietary.

Wot no 2.2.0? Where can I get the Hadoop 2.2.0 fix?

Page 57

© Hortonworks Inc. 2013. Confidential and Proprietary.

Like the Truth, Hadoop 2.2.0 is out there…

Page 58

Component HDP2.0 CDH4 CDH5 Beta

Intel IDH3.0

MapR 3 IBM Big Insights 2.1

Hadoop Common

2.2.0 2.0.0 2.2.0 2.0.4 N/A 1.1.1

Hive + HCatalog

0.12 0.10 + 0.5

0.11 0.10 + 0.5 0.11 0.9 + 0.4

Pig 0.12 0.11 0.11 0.10 0.11 0.10

Mahout 0.8 0.7 0.8 0.8 0.8 N/A

Flume 1.4.0 1.4.0 1.4.0 1.3.0 1.4.0 1.3.0

Oozie 4.0.0 3.3.2 4.0.0 3.3.0 3.3.2 3.2.0

Sqoop 1.4.4 1.4.3 1.4.4 1.4.3 1.4.4 1.4.2

HBase 0.96.0 0.94.6 95.2 0.94.7 94.9 0.94.3

© Hortonworks Inc. 2013. Confidential and Proprietary.

Thank You THUG Life

top related