delivering apache hadoop for the modern data architecture

30
Page 1 © Hortonworks Inc. 2014 Delivering Apache Hadoop for the Modern Data Architecture Cisco & Hortonworks. We do Hadoop Together

Upload: hortonworks

Post on 15-Jan-2015

878 views

Category:

Technology


0 download

DESCRIPTION

Join Hortonworks and Cisco as we discuss trends and drivers for a modern data architecture. Our experts will walk you through some key design considerations when deploying a Hadoop cluster in production. We'll also share practical best practices around Cisco-based big data architectures and Hortonworks Data Platform to get you started on building your modern data architecture.

TRANSCRIPT

Page 1: Delivering Apache Hadoop for the Modern Data Architecture

Page 1 © Hortonworks Inc. 2014

Delivering Apache Hadoop for the Modern Data Architecture

Cisco & Hortonworks. We do Hadoop Together

Page 2: Delivering Apache Hadoop for the Modern Data Architecture

Page 2 © Hortonworks Inc. 2014

Our speakers…

Ajay Singh Director Technical Channels, Hortonworks

Sean McKeown Solutions Architect, Data Center, Cisco

Page 3: Delivering Apache Hadoop for the Modern Data Architecture

Page 3 © Hortonworks Inc. 2014

Why Hadoop: Traditional Data Architecture Pressured

2.8 ZB in 2012

85% from New Data Types

15x Machine Data by 2020

40 ZB by 2020

Data source: IDC

SOU

RC

ES

OLTP, ERP, CRM

Documents, Emails

Web Logs, Click

Streams

Social Networks

Machine Generated

Sensor Data

Geolocation Data

Page 4: Delivering Apache Hadoop for the Modern Data Architecture

Page 4 © Hortonworks Inc. 2014

Sens

or

Serv

er

Logs

Text

So

cial

Geo

grap

hic

Mac

hine

Clic

kstr

eam

Stru

ctur

ed

Uns

truc

ture

d

Financial Services

New Account Risk Screens ✔ ✔

Trading Risk ✔

Insurance Underwriting ✔ ✔ ✔

Telecom Call Detail Records (CDR) ✔ ✔

Infrastructure Investment ✔ ✔

Real-time Bandwidth Allocation ✔ ✔ ✔

Retail 360° View of the Customer ✔ ✔

Localized, Personalized Promotions ✔

Website Optimization ✔

What: Business Applications of Hadoop

Page 5: Delivering Apache Hadoop for the Modern Data Architecture

Page 5 © Hortonworks Inc. 2014

Sens

or

Serv

er

Logs

Text

So

cial

Geo

grap

hic

Mac

hine

Clic

kstr

eam

Stru

ctur

ed

Uns

truc

ture

d

Manufacturing Supply Chain and Logistics ✔

Preventive Maintenance ✔

Crowd-sourced Quality Assurance ✔

Healthcare Use Genomic Data in Medial Trials ✔ ✔

Monitor Patient Vitals in Real-Time

Pharmaceuticals

Recruit & Retain Patients for Drug Trials ✔ ✔

Improve Prescription Adherence ✔ ✔ ✔

Oil & Gas Unify Exploration & Production Data ✔ ✔ ✔

Monitor Rig Safety in Real-Time ✔ ✔

Government ETL Offload in Response to Budgetary Pressures ✔

Sentiment Analysis for Gov’t Programs ✔

What: Business Applications of Hadoop

Page 6: Delivering Apache Hadoop for the Modern Data Architecture

Page 6 © Hortonworks Inc. 2014

OPERATIONS TOOLS

Provision, Manage & Monitor

DEV & DATA TOOLS

Build & Test

DAT

A SY

STEM

S A

PPLI

CAT

ION

S

Repositories

ROOMS

Statistical Analysis

BI / Reporting,

Ad Hoc Analysis

Interactive Web & Mobile Apps

Enterprise

Applications

RDBMS EDW MPP

How: Modern Data Architecture with Hadoop

Governa

nce    

&  In

tegra.

on  

Security  

Ope

ra.o

ns  

Data  Access  

Data  Management  

ENTERPRISE HADOOP

SOU

RC

ES

OLTP, ERP, CRM

Documents, Emails

Web Logs, Click Streams

Social Networks

Machine Generated

Sensor Data

Geolocation Data

Page 7: Delivering Apache Hadoop for the Modern Data Architecture

Page 7 © Hortonworks Inc. 2014

YARN Transforms Hadoop’s Architecture

   

Enables  deep  insight  across  a  large,  broad,  diverse  set  of  data  at  

efficient  scale    

Mul.-­‐Use  Data  Pla>orm  Store  all  data  in  one  place,  process  in  many  ways  

Batch   Interac.ve   Itera.ve   Streaming  

1   °   °   °   °   °   °   °   °   °  °   °   °   °   °   °   °   °   °   °  °   °   °   °   °   °   °   °   °   °  

°  °  °  

°  °  °  

°   °   °   °   °   °   °   °   °   °  °   °   °   °   °   °   °   °   °   °  °   °   °   °   °   °   °   °   °   °  

°  °  °  

°  °  n  

Store any/all raw data sources and processed data over extended periods of time.

YARN  :  Data  Opera.ng  System  

Page 8: Delivering Apache Hadoop for the Modern Data Architecture

Page 8 © Hortonworks Inc. 2014

Designing Hadoop Cluster

§ Cluster Storage Capacity

§ Server Specification

§ Cluster Size

§ Factoring Performance

Key Considerations §  Any piece of hardware can and will

fail

§  More nodes means less impact on failure

§  Resiliency and fault tolerance improve with scale

§  Build resiliency through scale

§  Still use modern hardware

§  Software handles hardware failures

Page 9: Delivering Apache Hadoop for the Modern Data Architecture

Page 9 © Hortonworks Inc. 2014

Storage Capacity

§  Key Input §  Initial Data Size §  3 year YOY growth §  Compression ratio §  Intermediate and materialized views §  Replication Factor

§  Note §  Hard to accurately predict the size of intermediate & materialized views at the start of a

project §  Be conservative with compression ratio. Mileage varies by data type §  Hadoop needs temp space to store intermediate files

Hadoop Cluster

Raw Data

Work In Process Data

Master Data

Materialized Views

Page 10: Delivering Apache Hadoop for the Modern Data Architecture

Page 10 © Hortonworks Inc. 2014

Storage Capacity

Total Storage Required

(Initial Size + "YOY Growth + Intermediate Data Size) "X Replication Count "X 1.2"

Compression Ratio"

Good Rule of Thumb

Replication Count = 3""Compression Ratio = 4-5""Intermediate Data Size = 50%-100% of Raw Data Size"

Note

1.2 factor is included in the sizing estimator to account for the temp space requirement of Hadoop"

Page 11: Delivering Apache Hadoop for the Modern Data Architecture

Page 11 © Hortonworks Inc. 2014

Server Specification

Page 11

§ Master Nodes – NameNode, Resource Manager, HBase Master §  Dual Intel Xeon E5-26xx series processors §  128GB or 256GB RAM per chassis §  4+ – 1TB NL-SAS/SATA Drives RAID10+ Spares

§ Worker Nodes – DataNode, Node Manager and Region Server §  Dual Intel Xeon E5-26xx series processors §  128GB RAM or 256GB RAM §  12 – 1-4 TB NLSAS/SATA Drives

§ Gateway Nodes / Edge Nodes §  Mirror of Master Nodes configuration

Page 12: Delivering Apache Hadoop for the Modern Data Architecture

Page 12 © Hortonworks Inc. 2014

Number of Data Nodes

Cluster Size

12

Storage Per Server

Number of Master Nodes §  Name Node, Zookeeper §  Resource Manager, Zookeeper §  Failover Name Node, HBase Master, Hive

Server, Zookeeper §  In a half-rack cluster, this would be combined with

Resource Manager §  Management Node (Ambari, Ganglia, Nagios)

§  In a half-rack cluster, this would be combined with the Name Node

Total Storage"Required"

Note §  Large clusters may need more than 4

master nodes §  Start at 2/4 and grow based on usage

Page 13: Delivering Apache Hadoop for the Modern Data Architecture

Page 13 © Hortonworks Inc. 2014

Factoring Performance

§ Data Nodes § 1 TB drives for performance clusters § 4 TB drives for archive clusters

§ Meeting SLA Requirements § Hadoop workloads are varied § Difficult to assess cluster size based on SLAs without actual testing § Good News: Hadoop performs linearly with scale

§  Enables one to design experiments using a fraction of data § Best Practice Guidance

§ Create a test configuration with a rack of servers §  Load a slice of data § Run tests with real-life queries to measure performance & fine tune the system § Scale cluster size based on observed performance

13

Page 14: Delivering Apache Hadoop for the Modern Data Architecture

Page 14 © Hortonworks Inc. 2014

OPERATIONAL  TOOLS  

DEV  &  DATA  TOOLS  

INFRASTRUCTURE  

HDP and Cisco are deeply integrated in the data center SO

UR

CES

EXISTING  Systems  

Clickstream   Web  &Social   Geoloca.on   Sensor  &  Machine  

Server  Logs   Unstructured  

DAT

A S

YSTE

M

RDBMS   EDW   MPP  HANA

APPLICAT

IONS  

BusinessObjects BI Deep Partnerships Hortonworks and Cisco engages in deep engineered relationships with the leaders in the data center, such as Microsoft, Teradata, Redhat, & SAP Broad Partnerships Over 600 partners work with Hortonworks to certify their applications to work with Hadoop so they can extend big data to their users

HDP 2.1

Gov

erna

nce

&

Inte

grat

ion

Secu

rity

Ope

ratio

ns

Data Access

Data Management

YARN

Page 15: Delivering Apache Hadoop for the Modern Data Architecture

Page 15 © Hortonworks Inc. 2014

Cisco + Hortonworks Validated Design

Sean McKeown Solutions Architect, Data Center, Cisco

Page 16: Delivering Apache Hadoop for the Modern Data Architecture

Page 16 © Hortonworks Inc. 2014

Cisco + Hortonworks Validated Design

Page 17: Delivering Apache Hadoop for the Modern Data Architecture

Page 17 © Hortonworks Inc. 2014

Cisco UCS Common Platform Architecture (CPA) Building Blocks for Big Data

17

UCS  6200  Series  Fabric  Interconnects  

Nexus  2232  Fabric  Extenders  

 

UCS  Manager  

UCS  240  M3  Servers  

LAN,  SAN,  Management  

Page 18: Delivering Apache Hadoop for the Modern Data Architecture

Page 18 © Hortonworks Inc. 2014

UCS + Hortonworks Reference Configurations

18

Cisco has more than 200 offices worldwide. Addresses, phone numbers, and fax numbers are listed on the Cisco Website at www.cisco.com/go/offices.

Cisco and the Cisco logo are trademarks or registered trademarks of Cisco and/or its affiliates in the U.S. and other countries. To view a list of Cisco trademarks, go to this URL: www.cisco.com/go/trademarks. Third party trademarks mentioned are the property of their respective owners. The use of the word partner does not imply a partnership relationship between Cisco and any other company. (1110R) LE-39604-01 10/13

Americas Headquarters Cisco Systems, Inc. San Jose, CA

Asia Pacific Headquarters Cisco Systems (USA) Pte. Ltd. Singapore

Europe Headquarters Cisco Systems International BV Amsterdam, The Netherlands

Cisco Common PlatformArchitecture Version 2 for Big Data

unformatted storage per rack for a total of 7.68 petabytes (PB) when scaled to a 10-rack configuration.

Capacity Optimized with Flash MemoryThis is the industry’s first big data solution to accelerate performance with a transparent, high-performance flash-memory cache powered by LSI Nytro MegaRAID technology. The card’s 200 GB of flash memory can be used as a transparent cache tier for hard disk drives and operating system images, freeing all 12 hard disk drives for data. It offers 768 TB of unformatted storage and 3.12 TB of flash memory

per rack, for a total of 7.68 PB and 31.25 TB of flash memory per domain. It is designed for big data applications including Cloudera, HortonWorks, Intel Distribution for Apache Hadoop, MapR, MarkLogic, Oracle NoSQL Database, ParAccel, and Pivotal Greenplum Database Pivotal HD solutions.

Easy OrderingCisco UCS CPA v2 for Big Data is available through Cisco UCS Solution Accelerator Paks (Table 1). The program helps you quickly and easily deploy a powerful, secure big data environment in your enterprise without the expense

entailed in designing and building your own custom solution. The solution scales by adding servers as needed.

For More InformationFor more information about Cisco UCS big data solutions, please visit http://www.cisco.com/go/bigdata.

For more information about the Cisco UCS CPA v2 for Big Data, please visit http://blogs.cisco.com/datacenter/cpav2.

Visit the Cisco big data design zone at http://www.cisco.com/go/bigdata_design.

Performance Optimized (UCS-SL-CPA2-P)

Performance and Capacity Balanced (UCS-SL-CPA2-PC)

Capacity Optimized (UCS-SL-CPA2-C)

Capacity Optimized with Flash Memory (UCS-SL-CPA2-CF)

Connectivity • 2 Cisco UCS 6248UP 48-Port Fabric Interconnects

• 2 Cisco Nexus® 2232PP 10GE Fabric Extenders

• 2 Cisco UCS 6296UP 96-Port Fabric Interconnects

• 2 Cisco Nexus 2232PP 10GE Fabric Extenders

• 2 Cisco UCS 6296UP 96-Port Fabric Interconnects

• 2 Cisco Nexus 2232PP 10GE Fabric Extenders

• 2 Cisco UCS 6296UP 96-Port Fabric Interconnects

• 2 Cisco Nexus 2232PP 10GE Fabric Extenders

Management Cisco UCS Manager Cisco UCS Manager Cisco UCS Manager Cisco UCS Manager

Servers 8 Cisco UCS C240 M3 Rack Servers, each with:• 2 Intel Xeon processors

E5-2680 v2• 256 GB of memory• LSI MegaRaid 9271CV

8i card• 24 900-GB 10K SFF SAS

drives (168 TB total)

16 Cisco UCS C240 M3 Rack Servers, each with:• 2 Intel Xeon processors

E5-2660 v2• 256 GB of memory• LSI MegaRaid 9271CV

8i card• 24 1-TB 7.2K SFF SAS

drives (384 TB total)

16 Cisco UCS C240 M3 Rack Servers, each with:• 2 Intel Xeon processors

E5-2640 v2• 128 GB of memory• LSI MegaRaid 9271CV

8i card• 12 4-TB 7.2K LFF SAS

drives (768 TB total)

16 Cisco UCS C240 M3 Rack Servers, each with:• 2 Intel Xeon processors

E5-2660 v2• 128 GB of memory• Cisco UCS Nytro

MegaRAID 200-GB Controller

• 12 4-TB 7.2K LFF SAS drives (768 TB total)

Table 1. Cisco CPA v2 for Big Data Includes Four Optimized Configurations

Page 19: Delivering Apache Hadoop for the Modern Data Architecture

Page 19 © Hortonworks Inc. 2014

Installing Servers Today

LAN

SAN

• RAID settings • Disk scrub actions

• Number of vHBAs • HBA WWN assignments • FC Boot Parameters • HBA firmware

• FC Fabric assignments for HBAs

• QoS settings • Border port assignment per vNIC • NIC Transmit/Receive Rate Limiting

• VLAN assignments for NICs • VLAN tagging config for NICs

• Number of vNICs • PXE settings • NIC firmware • Advanced feature settings

• Remote KVM IP settings • Call Home behaviour • Remote KVM firmware

• Server UUID • Serial over LAN settings • Boot order • IPMI settings • BIOS scrub actions • BIOS firmware • BIOS Settings

Page 20: Delivering Apache Hadoop for the Modern Data Architecture

Page 20 © Hortonworks Inc. 2014

UCS Service Profiles

LAN

SAN

Ser

vice

Pro

file

Page 21: Delivering Apache Hadoop for the Modern Data Architecture

Page 21 © Hortonworks Inc. 2014

Abstracting the Logical Architecture

21

Adapter

Switch

10GE A

Eth 1/1

FEX A

6200-A

Physical Cable

Virtual Cable (VN-Tag) Server

vNIC 1

10GE A

vEth 1

FEX A

Adapter

6200-A

vHBA 1

vFC 1

Service Profile

Cables

vNIC 1

vEth 1

6200-A

vHBA 1

vFC 1

(Server)

Server

ü  Dynamic, Rapid Provisioning

ü  State abstraction

ü  Location Independence

ü  Blade or Rack

What you get What you see

Chassis

Page 22: Delivering Apache Hadoop for the Modern Data Architecture

Page 22 © Hortonworks Inc. 2014

Cisco UCS: Physical Architecture

22

6200 Fabric A

6200 Fabric B

B200 VIC

FEX B

FEX A

SAN  A   SAN  B  ETH  1   ETH  2  

MGMT MGMT

Chassis 1

Fabric Switch

Fabric Extenders

Uplink Ports

Compute Blades Half / Full width

OOB Mgmt

Server Ports

Virtualized Adapters

Cluster

Rack Mount C240

VIC

FEX A FEX B

Page 23: Delivering Apache Hadoop for the Modern Data Architecture

Page 23 © Hortonworks Inc. 2014

Simple Scalability

23

Single Rack 16 servers

Single Domain Up to 10 racks, 160 servers,

7PBytes

Multiple Domains

L2/L3 Switching

Page 24: Delivering Apache Hadoop for the Modern Data Architecture

Page 24 © Hortonworks Inc. 2014

Proven performance and linear scalability

24

Page 25: Delivering Apache Hadoop for the Modern Data Architecture

Page 25 © Hortonworks Inc. 2014

Simplified Management Throughout Cluster Lifecycle

Provisioning

Monitoring

Maintenance

Growth

UCSM provides: •  Speed •  Ease of experimentation •  Consistency •  Simplicity •  Visibility

Page 26: Delivering Apache Hadoop for the Modern Data Architecture

Page 26 © Hortonworks Inc. 2014

Complete Network Flexibility

Example:

•  vNIC0 for management

•  vNIC1 for internal

•  vNIC2 for external

•  No OS bonding needed with Fabric Failover

Configure as vNICs and vLANs as you need with the click of a mouse

26

Data ingress/egress

VNIC 0

VNIC 0

VNIC 1

L2/L3 Switching

Data  Node  1  

VNIC 2

Data  Node  2  

6200 A

VNIC 2

6200 B

VNIC 1

Page 27: Delivering Apache Hadoop for the Modern Data Architecture

Page 27 © Hortonworks Inc. 2014

Creating QoS Policies and Enabling JumboFrames

27

!!

Best Effort policy for management VLAN Platinum policy for cluster VLAN

Page 28: Delivering Apache Hadoop for the Modern Data Architecture

Page 28 © Hortonworks Inc. 2014

Switch Buffer Usage With Network QoS Policy to prioritize

HBase Read Operations

0"

5000"

10000"

15000"

20000"

25000"

30000"

35000"

40000"

Latency((us)(

Time(

READ","Average"Latency"(us)" QoS","READ","Average"Latency"(us)"

1"

70"

139"

208"

277"

346"

415"

484"

553"

622"

691"

760"

829"

898"

967"

1036"

1105"

1174"

1243"

1312"

1381"

1450"

1519"

1588"

1657"

1726"

1795"

1864"

1933"

2002"

2071"

2140"

2209"

2278"

2347"

2416"

2485"

2554"

2623"

2692"

2761"

2830"

2899"

2968"

3037"

3106"

3175"

3244"

3313"

3382"

3451"

3520"

3589"

3658"

3727"

3796"

3865"

3934"

4003"

4072"

4141"

4210"

4279"

4348"

4417"

4486"

4555"

4624"

4693"

4762"

4831"

4900"

4969"

5038"

5107"

5176"

5245"

5314"

5383"

5452"

5521"

5590"

5659"

5728"

5797"

5866"

5935"

Buffer&Used&

Timeline&

Hadoop"TeraSort" Hbase"

Read Latency Comparison of Non-QoS vs. QoS Policy

~60% Read Improvement

HBase + Hadoop Map Reduce (Terasort)

Page 29: Delivering Apache Hadoop for the Modern Data Architecture

Page 29 © Hortonworks Inc. 2014

UCS Rack-Mount Servers

UCS Blade Servers

UCS Common Platform Architecture with Hortonworks

SAN/NAS Arrays

Enterprise Applications

Single Platform for Traditional and Big Data Applications

Page 30: Delivering Apache Hadoop for the Modern Data Architecture

Page 30 © Hortonworks Inc. 2014

THANK YOU [email protected] [email protected]