enterprise hadoop with hortonworks and nimble storage

26
Page 1 Hortonworks Confidential 2014 Enterprise Hadoop with Hortonworks and Nimble Storage Ajay Singh Director of Technical Alliance - Hortonworks Ibrahim “Ibby” Rahmani, Product and Solutions Marketing- Nimble Storage

Upload: hortonworks

Post on 15-Jul-2015

252 views

Category:

Technology


0 download

TRANSCRIPT

Page 1 Hortonworks Confidential 2014

Enterprise Hadoop with Hortonworks and Nimble Storage

Ajay Singh Director of Technical Alliance - Hortonworks Ibrahim “Ibby” Rahmani, Product and Solutions Marketing- Nimble Storage

Page 2 Hortonworks Confidential 2014

Agenda •  Hortonworks Overview

•  Big Data Use Cases

•  Hadoop Journey and Phases of Adoption

•  Requirements of Enterprise Hadoop

•  Key Trends

Page 3 Hortonworks Confidential 2014

Hortonworks Overview

Page 4 Hortonworks Confidential 2014

Our Mission: Power your Modern Data Architecture with HDP and Enterprise Apache Hadoop

Who we are June 2011: Original 24 architects, developers, operators of Hadoop from Yahoo! June 2014: An enterprise software company with 420+ Employees

Key Partners

Our model Innovate and deliver Apache Hadoop as a complete enterprise data platform completely in the open, backed by a world class support organization

Page 5 Hortonworks Confidential 2014

Who is Using Hadoop & Hortonworks

Page 6 Hortonworks Confidential 2014

HDP IS Apache Hadoop

There is ONE Enterprise Hadoop: everything else is a vendor derivation

Hortonworks Data Platform

Had

oop

&YA

RN

Pig

Hiv

e &

HC

atal

og

HB

ase

Sqo

op

Ooz

ie

Zoo

keep

er

Am

bari

Sto

rm

Flu

me

Kno

x

Pho

enix

Acc

umul

o

2.2.0 0.12.0

0.12.0 2.4.0

0.12.1

Data Management

0.13.0

0.96.1

0.98.0

0.9.1 1.4.4

1.3.1

1.4.0

1.4.4

1.5.1

3.3.2

4.0.0

3.4.5 0.4.0

4.0.0

1.5.1

Fal

con

0.5.0

Ran

ger

Spa

rk

Kaf

ka

0.14.0 0.14.0

0.98.4

1.6.1

4.2 0.9.3

1.2.0 0.6.0

0.8.1

1.4.5

1.5.0

1.7.0

4.1.0 0.5.0

0.4.0 2.6.0

* version numbers are targets and subject to change at time of general availability in accordance with ASF release process

3.4.5

Tez

0.4.0

Slid

er

0.60

HDP 2.0

October

2013

HDP 2.2 December

2014

HDP 2.1

April

2014

Sol

r

4.7.2

4.10.0

0.5.1

Data Access Governance & Integration Security Operations

Page 7 Hortonworks Confidential 2014

YARN  :  Data  Opera.ng  System  

Script    Pig      

Search    

Solr      

SQL    

Hive/Tez,  HCatalog  

   

NoSQL    

HBase  Accumulo  

   

Stream      

Storm        

Batch    

Map  Reduce  

   

HDFS    (Hadoop  Distributed  File  System)  

Contributes more to the Apache Hadoop ecosystem in the ASF than any other vendor

Hadoop is a platform decision

•  Open Source: fastest path to innovation for a platform technology

•  Eliminate vendor lock in, no proprietary software

•  Data center leaders have committed to the open source approach

Apache Project Committers PMC

Members

Hadoop 27 20

Accumulo 2 2

Ambari 33 27

Falcon 5 3

Flume 1 0

HBase 6 4

Hive 17 4

Knox 12 3

Oozie 3 2

Pig 5 5

Sqoop 1 1

Storm 3 2

Tez 15 15

Zookeeper 2 1

TOTAL 132 89

HDP 2.1

Gov

erna

nce

&

Inte

grat

ion

Secu

rity

Ope

ratio

ns

Data Access

Data Management

YARN

Community Leadership

Leading Hadoop Innovations; 100% Open Source

Page 8 Hortonworks Confidential 2014

Proven By Customer Success

Customer Momentum

•  300+ customers in seven quarters, growing at 75+/quarter •  30+ customers migrated from other distributions •  Two thirds of customers come from F1000 •  100% Renewal Rate

Largest Cluster in North America

32,000 Nodes Largest Cluster in Europe

1,000 Nodes

Experience at Scale 80,000 nodes under contract

Largest Known Cluster in APAC

400 Nodes

Fastest growing Fortune 1000 customer base

Market Leadership

Page 9 Hortonworks Confidential 2014

Big Data Trends & The Modern Data Architecture

Page 10 Hortonworks Confidential 2014

APP

LIC

ATIO

NS

DAT

A S

YSTE

M

Business Analytics

Custom Applications

Packaged Applications

Traditional systems under pressure

•  Silos of Data •  Costly to Scale

•  Constrained Schemas

Clickstream

Geolocation

Sentiment, Web Data

Sensor. Machine Data

Unstructured docs, emails

Server logs

SOU

RC

ES

Existing Sources (CRM, ERP,…)

RDBMS EDW MPP

New Data Types

…and difficult to manage new data

LIMITATIONS Silos & Expensive

Single Purpose

Page 11 Hortonworks Confidential 2014

1. Unlock New Applications from New Types of Data INDUSTRY USE CASE Sentiment

& Web Clickstream & Behavior

Machine & Sensor Geographic Server Logs Structured &

Unstructured

Financial Services New Account Risk Screens ✔ ✔

Trading Risk ✔

Insurance Underwriting ✔ ✔ ✔

Telecom Call Detail Records (CDR) ✔ ✔

Infrastructure Investment ✔ ✔

Real-time Bandwidth Allocation ✔ ✔ ✔

Retail 360° View of the Customer ✔ ✔ ✔

Localized, Personalized Promotions ✔

Website Optimization ✔

Manufacturing Supply Chain and Logistics ✔

Assembly Line Quality Assurance ✔

Crowd-sourced Quality Assurance ✔

Healthcare Use Genomic Data in Medial Trials ✔ ✔ ✔

Monitor Patient Vitals in Real-Time ✔ ✔

Pharmaceuticals Recruit and Retain Patients for Drug Trials ✔ ✔

Improve Prescription Adherence ✔ ✔ ✔ ✔

Oil & Gas Unify Exploration & Production Data ✔ ✔ ✔ ✔

Monitor Rig Safety in Real-Time ✔ ✔ ✔

Government ETL Offload/Federal Budgetary Pressures ✔ ✔

Sentiment Analysis for Government Programs ✔

Page 12 Hortonworks Confidential 2014

2. Or to realize a dramatic cost savings…

EDW Optimization

OPERATIONS 50%

ANALYTICS 20%

ETL PROCESS 30%

OPERATIONS 50% ANALYTICS

50%

Current Reality EDW at capacity: some usage from low value workloads

Older data archived, unavailable for ongoing exploration

Source data often discarded

Augment w/ Hadoop

Free up EDW resources from low value tasks

Keep 100% of source data and historical data for ongoing exploration

Mine data for value after loading it because of schema-on-read

MPP

SAN

Engineered System

NAS

HADOOP

Cloud Storage

$0 $20,000 $40,000 $60,000 $80,000 $180,000

Fully-loaded Cost Per Raw TB of Data (Min–Max Cost)

Commodity Compute & Storage Hadoop Enables Scalable Compute & Storage at a Compelling Cost Structure

Hadoop Parse, Cleanse

Apply Structure, Transform

Storage Costs/Compute Costs from $19/GB to $0.23/GB

Page 13 Hortonworks Confidential 2014

3. Data Lake: An architectural shift SC

ALE

SCOPE

Unlocking the Data Lake  

RDBMS

MPP

EDW

Data Lake Enabled by YARN •  Single data repository,

shared infrastructure

•  Multiple biz apps accessing all the data

•  Enable a shift from reactive to proactive interactions

•  Gain new insight across the entire enterprise

New Analytic Apps or IT Optimization

HDP 2.1

Gov

erna

nce

&

Inte

grat

ion

Secu

rity

Ope

ratio

ns

Data Access

Data Management

YARN

Page 14 Hortonworks Confidential 2014

Hadoop Journey and Phases of Adoption

Page 15 Hortonworks Confidential 2014

Business Value from Hadoop Flight Plan for a Journey in Four Phases

1 2 Evaluation –

Business Value

Awareness & Interest

Evaluation – Technical

Enterprise Deployment

Enterprise Production

Industry Leadership

Point Deployment

Point Production

3 4 Operational Value Strategic Value Data-Driven

Organization

* Timeline varies by company size. Often smaller or focused online businesses achieve milestones at the shorter end of the range.

Flight plan – typical elapsed time* from start of phase 1 in months:

2-6 9-15 18-36

Potential Value

Page 16 Hortonworks Confidential 2014

1 2 3 4 What Would You Like to Accomplish? Levels of Success with Hadoop

Potential Value Operational Value Strategic Value Data-Driven Organization

CXO •  Recognition of potential •  Mandate to explore

•  Recognition of value realized •  Sponsorship to expand use

•  Recognition of material value realized •  Sponsorship to transform organization

•  Competitive advantage •  CDO part of Exec Team

Line of Business

•  Basic understanding of the value of Hadoop to the business

•  Value realized in 1 area ‒  Customer intimacy ‒  Operational excellence ‒  Risk, security, compliance ‒  New business

•  Value realized and tracked in many areas ‒  Customer intimacy ‒  Operational excellence ‒  Risk, security, compliance ‒  New business

•  Data managed like capital •  Intelligence at the front line •  JIT decision making •  Widespread value creation

Analytics & Applications

•  Basic understanding how Hadoop fits into existing landscape

•  BI and EDW access to Hadoop •  Some new analytic apps, often batch •  Few use cases and processing engines •  Many sources and time periods •  Mostly departmental silos •  10-50 enterprise users

•  Hadoop consumable by any department, both technically and process-wise

•  New apps natively on Hadoop, often transactional or real-time

•  Many use cases and processing engines •  Multiple lenses into common data pool •  Emerging data science team •  50-500 enterprise users

•  Data-driven culture •  High-performing data

science team •  Use cases build on each

other •  500-5000 enterprise users

Data Mgt. & Security

•  Basic understanding how Hadoop fits

•  Benefitting from schema on read •  Professionalizing data definitions and

models

•  Collaboration and granular security controls governing use of shared data

•  Incentives and process to encourage consumption of shared data

Infra-structure

•  Basic fluency with core technical concepts of Hadoop

•  1 or more production environments •  Multi-tenant shared service worldwide •  Data Lake •  Service Desk / CoE

•  Hadoop community participation and contribution

Page 17 Hortonworks Confidential 2014

Requirements of Enterprise Hadoop

Page 18 Hortonworks Confidential 2014

The 1st Generation of Hadoop: Batch

HADOOP 1.0 Built for Web-Scale Batch Apps

Single  App  

BATCH

HDFS

Single  App  

INTERACTIVE

Single  App  

BATCH

HDFS

•  All other usage patterns must leverage that same infrastructure

•  Forces the creation of silos for managing mixed workloads

Single  App  

BATCH

HDFS

Single  App  

ONLINE

Page 19 Hortonworks Confidential 2014

2009 2006

1   °   °   °   °   °  

°   °   °   °   °   N  

HDFS    (Hadoop  Distributed  File  System)  

MapReduce  Largely  Batch  Processing  

Hadoop  w/  MapReduce

YARN: Data Operating System

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° °

°

° N

HDFS (Hadoop Distributed File System)

Hadoop2 & YARN based Architecture

Silo’d clusters Largely batch system Difficult to integrate

MR-­‐279:  YARN

Hadoop 2 & YARN

Interactive Real-Time Batch

Architected & led development of YARN to enable the Modern Data Architecture

October 23, 2013

Page 20 Hortonworks Confidential 2014

A Blueprint for Enterprise Hadoop

Load data and manage

according to policy

Deploy and effectively

manage the platform

Store and process all of your Corporate Data Assets

Access your data simultaneously in multiple ways (batch, interactive, real-time) Provide layered

approach to security through Authentication, Authorization,

Accounting, and Data Protection

DATA MANAGEMENT

SECURITY DATA ACCESS GOVERNANCE & INTEGRATION OPERATIONS

Enable both existing and new application to provide value to the organization

PRESENTATION & APPLICATION

Empower existing operations and security tools to manage Hadoop

ENTERPRISE MGMT & SECURITY

Provide deployment choice across physical, virtual, cloud

DEPLOYMENT OPTIONS

YARN Data Operating System

Page 21 Hortonworks Confidential 2014

HDP delivers a comprehensive data management platform

Hortonworks Data Platform 2.2

YARN: Data Operating System (Cluster Resource Management)

1 ° ° ° ° ° ° °

° ° ° ° ° ° ° °

Script

Pig

SQL

Hive

Tez Tez

Java Scala

Cascading

Tez

° °

° °

° ° ° ° °

° ° ° ° °

Others

ISV Engines

HDFS (Hadoop Distributed File System)

Stream

Storm

Search

Solr

NoSQL

HBase Accumulo

Slider Slider

SECURITY GOVERNANCE OPERATIONS BATCH, INTERACTIVE & REAL-TIME DATA ACCESS

In-Memory

Spark

Provision, Manage & Monitor

Ambari

Zookeeper

Scheduling

Oozie

Data Workflow, Lifecycle & Governance

Falcon Sqoop Flume Kafka NFS

WebHDFS

Authentication Authorization Accounting

Data Protection

Storage: HDFS Resources: YARN Access: Hive, … Pipeline: Falcon

Cluster: Knox Cluster: Ranger

Deployment Choice Linux Windows On-Premises Cloud

YARN is the architectural center of HDP

Enables batch, interactive and real-time workloads

Provides comprehensive enterprise capabilities

The widest range of deployment options

Delivered Completely in the OPEN

Page 22 Hortonworks Confidential 2014

Key Trends

Page 23 Hortonworks Confidential 2014

Modern Data Architecture

•  Enterprise Hadoop as single consolidated Data Lake

•  Deep Integration with existing systems

•  Accelerated Interactive & Real-Time Capabilities

•  Central services for security, governance and operation

APP

LIC

ATIO

NS

DAT

A S

YSTE

M

Business Analytics

Custom Applications

Packaged Applications

RDBMS

EDW

MPP

YARN: Data Operating System

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° N

Interactive Real-Time Batch CRM

ERP

Other 1 ° ° °

° ° ° °

HDFS (Hadoop Distributed File System)

SOU

RC

ES

EXISTING  Systems  

Clickstream   Web    &Social  

Geoloca.on   Sensor    &  Machine  

Server    Logs  

Unstructured  

Hadoop As Enterprise Data Lake

Page 24 Hortonworks Confidential 2014

Development & POC Cluster

Production Cluster

Multiple Deployment Choices

Deployment Choice •  Linux, Windows •  On-Premises, Public/Private Cloud,

Hybrid

“Tethered” Clusters •  Compatible services •  An explicit “connection”

Synchronized Datasets •  Efficient sharing & access •  Governance & lineage

BI or ML Cluster

Backup & Archive Cluster

Learn

On-Premise & Cloud Deployments Physical & Virtual Clusters

Page 25 Hortonworks Confidential 2014

Cloud Backup & Storage Tiering Dataset Backup / Archival •  Deliver business continuity through replication across on-

premises and cloud-based storages targets; Microsoft Azure and Amazon S3

•  Lineage as a GA feature with supporting documentation and examples

Storage Tiers in HDFS •  HDFS Heterogeneous storage tiering feature

•  Allow for the definition of hot/cold storage tiers within a cluster with all data remaining in cluster for data lake

•  Higher density storage, lower CPU and memory footprint machines further drive costs down for the hardware used in the cold storage tier

Backup & Archive Cluster

Production Cluster

Page 26 Hortonworks Confidential 2014

Expanded Infrastructure Choices

Servers with Internal Storage

§  High performance

§  Low upfront cost

§  Limited data movement

Key Technology Trends §  Fast & cost effective networks §  SSD storage

Servers with Shared Storage

§  Ease of administration

§  Independent scale out of compute & storage

§  Shared storage infrastructure for Big Data and Legacy applications

§  High memory servers §  Scale out shared storage sub systems