hp converged systems and hortonworks - webinar slides

31
Page 1 © Hortonworks Inc. 2014 Delivering Apache Hadoop for the Modern Data Architecture HP & Hortonworks. We do Hadoop Together

Upload: hortonworks

Post on 02-Dec-2014

416 views

Category:

Software


3 download

DESCRIPTION

Our experts will walk you through some key design considerations when deploying a Hadoop cluster in production. We'll also share practical best practices around HP and Hortonworks Data Platform to get you started on building your modern data architecture. Learn how to: - Leverage best practices for deployment - Choose a deployment model - Design your Hadoop cluster - Build a Modern Data Architecture and vision for the Data Lake

TRANSCRIPT

Page 1: Hp Converged Systems and Hortonworks - Webinar Slides

Page 1 © Hortonworks Inc. 2014

Delivering Apache Hadoop for the Modern Data Architecture

HP & Hortonworks. We do Hadoop Together

Page 2: Hp Converged Systems and Hortonworks - Webinar Slides

Page 2 © Hortonworks Inc. 2014

Your speakers…

Raghu Thiagarajan Director, Partner Product Management, Hortonworks

Chris Daly Chief Outbound Engineer, CSS and Big Data Systems, HP

Page 3: Hp Converged Systems and Hortonworks - Webinar Slides

Page 3 © Hortonworks Inc. 2014

Why Hadoop: Traditional Data Architecture Pressured

2.8 ZB in 2012

85% from New Data Types

15x Machine Data by 2020

40 ZB by 2020

Data source: IDC

SOU

RC

ES

OLTP, ERP, CRM

Documents, Emails

Web Logs, Click

Streams

Social Networks

Machine Generated

Sensor Data

Geolocation Data

Page 4: Hp Converged Systems and Hortonworks - Webinar Slides

Page 4 © Hortonworks Inc. 2014

Sens

or

Serv

er

Logs

Text

So

cial

Geo

grap

hic

Mac

hine

Clic

kstr

eam

Stru

ctur

ed

Uns

truc

ture

d

Financial Services

New Account Risk Screens ✔ ✔

Trading Risk ✔

Insurance Underwriting ✔ ✔ ✔

Telecom Call Detail Records (CDR) ✔ ✔

Infrastructure Investment ✔ ✔

Real-time Bandwidth Allocation ✔ ✔ ✔

Retail 360° View of the Customer ✔ ✔

Localized, Personalized Promotions ✔

Website Optimization ✔

What: Business Applications of Hadoop

Page 5: Hp Converged Systems and Hortonworks - Webinar Slides

Page 5 © Hortonworks Inc. 2014

Sens

or

Serv

er

Logs

Text

So

cial

Geo

grap

hic

Mac

hine

Clic

kstr

eam

Stru

ctur

ed

Uns

truc

ture

d

Manufacturing Supply Chain and Logistics ✔

Preventive Maintenance ✔

Crowd-sourced Quality Assurance ✔

Healthcare Use Genomic Data in Medial Trials ✔ ✔

Monitor Patient Vitals in Real-Time

Pharmaceuticals

Recruit & Retain Patients for Drug Trials ✔ ✔

Improve Prescription Adherence ✔ ✔ ✔

Oil & Gas Unify Exploration & Production Data ✔ ✔ ✔

Monitor Rig Safety in Real-Time ✔ ✔

Government ETL Offload in Response to Budgetary Pressures ✔

Sentiment Analysis for Gov’t Programs ✔

What: Business Applications of Hadoop

Page 6: Hp Converged Systems and Hortonworks - Webinar Slides

Page 6 © Hortonworks Inc. 2014

OPERATIONS TOOLS

Provision, Manage & Monitor

DEV & DATA TOOLS

Build & Test

DAT

A SY

STEM

S A

PPLI

CAT

ION

S

Repositories

ROOMS

Statistical Analysis

BI / Reporting,

Ad Hoc Analysis

Interactive Web & Mobile Apps

Enterprise

Applications

RDBMS EDW MPP

How: Modern Data Architecture with Hadoop

Governa

nce    

&  In

tegra.

on  

Security  

Ope

ra.o

ns  

Data  Access  

Data  Management  

ENTERPRISE HADOOP

SOU

RC

ES

OLTP, ERP, CRM

Documents, Emails

Web Logs, Click Streams

Social Networks

Machine Generated

Sensor Data

Geolocation Data

Page 7: Hp Converged Systems and Hortonworks - Webinar Slides

Page 7 © Hortonworks Inc. 2014

YARN Transforms Hadoop’s Architecture

   

Enables  deep  insight  across  a  large,  broad,  diverse  set  of  data  at  

efficient  scale    

Mul.-­‐Use  Data  Pla>orm  Store  all  data  in  one  place,  process  in  many  ways  

Batch   Interac.ve   Itera.ve   Streaming  

1   °   °   °   °   °   °   °   °   °  °   °   °   °   °   °   °   °   °   °  °   °   °   °   °   °   °   °   °   °  

°  °  °  

°  °  °  

°   °   °   °   °   °   °   °   °   °  °   °   °   °   °   °   °   °   °   °  °   °   °   °   °   °   °   °   °   °  

°  °  °  

°  °  n  

Store any/all raw data sources and processed data over extended periods of time.

YARN  :  Data  Opera.ng  System  

Page 8: Hp Converged Systems and Hortonworks - Webinar Slides

Page 8 © Hortonworks Inc. 2014

Designing Hadoop Cluster

§ Cluster Storage Capacity

§ Server Specification

§ Cluster Size

§ Factoring Performance

Key Considerations §  Any piece of hardware can and will

fail

§  More nodes means less impact on failure

§  Resiliency and fault tolerance improve with scale

§  Build resiliency through scale

§  Still use modern hardware

§  Software handles hardware failures

Page 9: Hp Converged Systems and Hortonworks - Webinar Slides

Page 9 © Hortonworks Inc. 2014

Storage Capacity

§  Key Input §  Initial Data Size §  3 year YOY growth §  Compression ratio §  Intermediate and materialized views §  Replication Factor

§  Note §  Hard to accurately predict the size of intermediate & materialized views at the start of a

project §  Be conservative with compression ratio. Mileage varies by data type §  Hadoop needs temp space to store intermediate files

Hadoop Cluster

Raw Data

Work In Process Data

Master Data

Materialized Views

Page 10: Hp Converged Systems and Hortonworks - Webinar Slides

Page 10 © Hortonworks Inc. 2014

Storage Capacity

Total Storage Required

(Initial Size + "YOY Growth + Intermediate Data Size) "X Replication Count "X 1.2"

Compression Ratio"

Good Rule of Thumb

Replication Count = 3""Compression Ratio = 4-5""Intermediate Data Size = 50%-100% of Raw Data Size"

Note

1.2 factor is included in the sizing estimator to account for the temp space requirement of Hadoop"

Page 11: Hp Converged Systems and Hortonworks - Webinar Slides

Page 11 © Hortonworks Inc. 2014

Server Specification § Master Nodes – NameNode, Resource Manager, HBase Master

§  Dual Intel Xeon E5-26xx series processors §  128GB or 256GB RAM per chassis §  4+ – 1TB NL-SAS/SATA Drives RAID10+ Spares

§ Worker Nodes – DataNode, Node Manager and Region Server §  Dual Intel Xeon E5-26xx series processors §  128GB RAM or 256GB RAM §  12 – 1-4 TB NLSAS/SATA Drives

§ Gateway Nodes / Edge Nodes §  Mirror of Master Nodes configuration

Page 12: Hp Converged Systems and Hortonworks - Webinar Slides

Page 12 © Hortonworks Inc. 2014

Number of Data Nodes

Cluster Size

12

Storage Per Server

Number of Master Nodes §  Name Node, Zookeeper §  Resource Manager, Zookeeper §  Failover Name Node, HBase Master, Hive

Server, Zookeeper §  In a half-rack cluster, this would be combined with

Resource Manager §  Management Node (Ambari, Ganglia, Nagios)

§  In a half-rack cluster, this would be combined with the Name Node

Total Storage"Required"

Note §  Large clusters may need more than 4

master nodes §  Start at 2/4 and grow based on usage

Page 13: Hp Converged Systems and Hortonworks - Webinar Slides

Page 13 © Hortonworks Inc. 2014

Factoring Performance

§ Data Nodes § 1 TB drives for performance clusters § 4 TB drives for archive clusters

§ Meeting SLA Requirements § Hadoop workloads are varied § Difficult to assess cluster size based on SLAs without actual testing § Good News: Hadoop performs linearly with scale

§  Enables one to design experiments using a fraction of data § Best Practice Guidance

§ Create a test configuration with a rack of servers §  Load a slice of data § Run tests with real-life queries to measure performance & fine tune the system § Scale cluster size based on observed performance

13

Page 14: Hp Converged Systems and Hortonworks - Webinar Slides

Page 14 © Hortonworks Inc. 2014

OPERATIONAL  TOOLS  

DEV  &  DATA  TOOLS  

INFRASTRUCTURE  

HDP and HP are deeply integrated in the data center SO

UR

CES

EXISTING  Systems  

Clickstream   Web  &Social   Geoloca.on   Sensor  &  Machine  

Server  Logs   Unstructured  

DAT

A S

YSTE

M

RDBMS   EDW   MPP  HANA

APPLICAT

IONS  

BusinessObjects BI Deep Partnerships Hortonworks and HP engaged in deep engineered relationships with the leaders in the data center, such as Microsoft, Teradata, Redhat, & SAP Broad Partnerships Over 600 partners work with Hortonworks to certify their applications to work with Hadoop so they can extend big data to their users

HDP 2.1

Gov

erna

nce

&

Inte

grat

ion

Secu

rity

Ope

ratio

ns

Data Access

Data Management

YARN

Page 15: Hp Converged Systems and Hortonworks - Webinar Slides

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Delivering Apache Hadoop for the Modern Data Architecture HP + Hortonworks Validated Design Christopher Daly

Page 16: Hp Converged Systems and Hortonworks - Webinar Slides

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 16

The HP Approach to Apache Hadoop

Why a Reference Architecture?

•  Provides a starting point or baseline

•  Maximum flexibility •  Customizable to fit YOUR needs •  Adopt the parts you want •  Replace the parts you don’t

Page 17: Hp Converged Systems and Hortonworks - Webinar Slides

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 17

Solution components

Page 18: Hp Converged Systems and Hortonworks - Webinar Slides

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 18

Pre-deployment considerations / system selection

• Operating system • Computation • Memory •  Storage • Network

Page 19: Hp Converged Systems and Hortonworks - Webinar Slides

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 19

High-availability considerations

• Hadoop NameNode HA • ResourceManager HA • OS availability and reliability • Network reliability • Power supply

Page 20: Hp Converged Systems and Hortonworks - Webinar Slides

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 20

Management nodes – The HP ProLiant DL360p Gen8 Server selection

The Management node and head nodes, as tested in the Reference Architecture, contain the following base configuration: 2 x Eight-Core Intel E5-2650 v2 Processors Smart Array P420i Controller with 512MB FBWC 3.6 TB – 4 x 900GB SFF SAS 10K RPM disks 128 GB DDR3 Memory – 8 x 16GB 2Rx4 PC3-14900R-13 10GbE 2P NIC 561FLR-T card

Page 21: Hp Converged Systems and Hortonworks - Webinar Slides

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 21

Worker nodes – ProLiant DL380p Gen8

Server selection

The ProLiant DL380p Gen8 (2U) as configured for the Reference Architecture as a worker node has the following configuration: Dual 10-Core Intel Xeon E5-2670 v2 Processors with Hyper-Threading Twelve 2TB 3.5” 7.2K LFF SATA MDL (22 TB for Data) 128 GB DDR3 Memory (8 x HP 16GB), 4 channels per socket 1 x 10GbE 2 Port NIC FlexibleLOM (Bonded) 1 x Smart Array P420i Controller with 512MB FBWC

Page 22: Hp Converged Systems and Hortonworks - Webinar Slides

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 22

Switch selection

Top of Rack (ToR) switches The 5900AF-48XGT-4QSFP+10GbE is an ideal ToR switch with forty eight 10GbE ports and four 40GbE uplinks providing resiliency, high availability and scalability support. In addition this model comes with support for CAT6 cables (copper wires) and Software defined networking (SDN).

Aggregation switches The FlexFabric 5930-32QSFP+40GbE switch is an ideal aggregation switch as it is well suited to handle very large volumes of inter-rack traffic such as can occur during shuffle and sort operations, or large scale block replication to recreate a failed node

Page 23: Hp Converged Systems and Hortonworks - Webinar Slides

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 23

HP Insight CMU – pushbutton scale-out management

Provision, monitor, and control Thousands of nodes instantly

Push-button roll out Provisioning via cloning for seamless scaling

Rest easy Battletested at top 500 sites for over a decade

Page 24: Hp Converged Systems and Hortonworks - Webinar Slides

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 24

Historical analysis and job recording

HP Insight CMU – GUI Monitoring at a Cluster level

•  Designed for Big Data customer

•  Multi-petal aggregated, 3D RT, and time series views of cluster metrics

•  “Click & zoom” analysis at both solution and component levels

•  Proactively identify and isolate performance issues

Page 25: Hp Converged Systems and Hortonworks - Webinar Slides

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 25

Single Rack Reference Architecture

Page 26: Hp Converged Systems and Hortonworks - Webinar Slides

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 26

Multi-Rack Reference Architecture

Page 27: Hp Converged Systems and Hortonworks - Webinar Slides

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 27

Capacity and sizing

Here is a general guideline on data inventory: • Sources of data • Frequency of data • Raw storage • Processed HDFS storage • Replication factor • Default compression turned on • Space for intermediate files

Page 28: Hp Converged Systems and Hortonworks - Webinar Slides

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 28

System configuration guidance Machine Type

Workload Patten/Cluster Type

Storage Processor (# of Cores)

Memory (GB)

Network

Slaves

Balanced workload Four to six 1-2 TB disks

Dual 6/8/10 cores 48-96

Dual 10 GB links for all nodes in a 20 node rack and min 2x10 / 2 x 40 GB interconnect links per rack going to a pair of central switches

Compute intensive

workload Four to six 1-2 TB disks

Dual 8/10/12 cores 48-128

IO intensive workload Twelve 1-2 TB disks

Dual 8/10/12 cores 48-96

HBase clusters Twelve 1-2 TB disks

Dual 8/10/12 cores 48-128

Masters All workload patterns/HBase clusters

Four to six 1-2 TB disks

Dual 6/8/10 cores

Depends on number of file system objects to be created by NameNode.

Page 29: Hp Converged Systems and Hortonworks - Webinar Slides

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 29

For More Information Get the Reference Architecture at http://h20195.www2.hp.com/V2/GetDocument.aspx?docname=4AA5-4975ENW Hortonworks www.hortonworks.com HP Solutions for Apache Hadoop hp.com/go/Hadoop HP ProLiant servers hp.com/go/proliant HP Insight Cluster Management Utility (CMU) hp.com/go/cmu HP Networking hp.com/go/networking

Or Contact Me: [email protected]

Page 30: Hp Converged Systems and Hortonworks - Webinar Slides

Page 30 © Hortonworks Inc. 2014

Next Steps...

Download the Hortonworks Sandbox

Learn Hadoop

Build Your Analytic App

Try Hadoop 2

More about HP & Hortonworks http://hortonworks.com/partner/HP

Contact us: [email protected]

Page 31: Hp Converged Systems and Hortonworks - Webinar Slides

Page 31 © Hortonworks Inc. 2014

THANK YOU