pivotal: hadoop for powerful processing of unstructured data for valuable insights

43
1 © Copyright 2013 EMC Corporation. All rights reserved. Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights SK Krishnamurthy [email protected]

Upload: emc-academic-alliance

Post on 26-Jan-2015

108 views

Category:

Technology


2 download

DESCRIPTION

Pivotal has setup and operationalized 1000 node Hadoop cluster called the Analytics Workbench. It takes special setup and skills to manage such a large deployment. This session shares how we set it up and how you will manage it. Objective 1: Understand what it takes to operationalize a 1000-nodeHadoop cluster. After this session you will be able to: Objective 2: Understand how to set up and manage the day to day challenges of a large Hadoop deployments. Objective 3: Have a view to the tools that are necessary to solve the challenges of managing the large Hadoop cluster.

TRANSCRIPT

Page 1: Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights

1 © Copyright 2013 EMC Corporation. All rights reserved.

Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights SK Krishnamurthy [email protected]

Page 2: Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights

2 © Copyright 2013 EMC Corporation. All rights reserved.

Traditional Enterprise Analytics Process

Page 3: Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights

3 © Copyright 2013 EMC Corporation. All rights reserved.

The Fundamental Paradigm Shift

Internet age and exploding data growth

Enterprises leverage new data sources to identify emerging trends and opportunities

Traditional database tools not able to cope

Page 4: Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights

4 © Copyright 2013 EMC Corporation. All rights reserved.

Enter Hadoop

Flexible

Scalable

Inexpensive

Fault-tolerant

Rapidly Adopted

Platform for Big

Data

Page 5: Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights

5 © Copyright 2013 EMC Corporation. All rights reserved.

Evolution of Process with Hadoop

Page 6: Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights

6 © Copyright 2013 EMC Corporation. All rights reserved.

$-

$20,000

$40,000

$60,000

$80,000

2008 2009 2010 2011 2012 2013

Big Data Platform Price/TB

Big Data DB Hadoop

HDFS Economics Have Changed the Game

Big Data RDBMS pricing will

ultimately converge with

Hadoop pricing

The price per TB of Big Data RDMBS has

been consistently eroding over time.

Hadoop pricing has increased slightly over

time as vendors have injected value added

services into the ecosystem.

Page 7: Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights

7 © Copyright 2013 EMC Corporation. All rights reserved. 7 © Copyright 2013 Pivotal. All rights reserved.

Where We’re Going

Page 8: Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights

8 © Copyright 2013 EMC Corporation. All rights reserved.

Big Data Platform

Analytical Query Operational Intelligence

In-Memory DB

Run-Time Applications

In-Memory Objects

Enterprise Data Warehouse

RDBMS

Continues to serve as system of record

HDFS

Data Staging Platform

Data Mgmt. Services

Data Visualization

Compliance and financial reporting

Traditional BI/Reporting

Pivotal Data Platform

Data Visualization

Stream Ingestion

Streaming Services

Page 9: Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights

9 © Copyright 2013 EMC Corporation. All rights reserved.

Flexible Deployment Model

deploy

Portable Elastic HW Abstracted Manageable “Consumer” grade

Public Cloud On Premise Private Cloud

Page 10: Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights

10 © Copyright 2013 EMC Corporation. All rights reserved.

PIVOTAL HD The world’s most powerful Hadoop distribution

Page 11: Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights

11 © Copyright 2013 EMC Corporation. All rights reserved.

Pivotal HD World’s first true SQL processing for enterprise-ready

Hadoop

100% Apache Hadoop-based platform

Virtualization and cloud ready with VMWare and Isilon

Scale tested in 1000 node Pivotal Analytics Workbench

Available as a software-only or appliance-based solution

Backed by EMC’s global, 24x7 support infrastructure

Page 12: Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights

12 © Copyright 2013 EMC Corporation. All rights reserved.

Pivotal Hadoop Distributions

GPHD Pivotal HD

100% Open Source Compatible

Apache Hadoop 1.x Apache Hadoop 2.x

Page 13: Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights

13 © Copyright 2013 EMC Corporation. All rights reserved.

• HDFS – The Hadoop Distributed File System acts as the storage layer for Hadoop

• MapReduce – Parallel processing framework used for data computation in Hadoop

• Hive – Structured, data warehouse implementation for data in HDFS that provides a SQL-like interface to Hadoop

• Pig – High-level procedural language for data pipeline/data flow processing in Hadoop

• HBase – NoSQL, key-value data store on top of HDFS

• Mahout – Library of scalable machine-learning Algorithms

• Spring Hadoop – Integrates the Spring framework into Hadoop

Pivotal HD Components

Page 14: Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights

14 © Copyright 2013 EMC Corporation. All rights reserved.

• Installation and Configuration Manager (ICM) – cluster installation, upgrade, and expansion tools.

• GP Command Center – visual interface for cluster health, system metrics, and job monitoring.

• Hadoop Virtualization Extension (HVE) – enhances Hadoop to support virtual node awareness and enables greater cluster elasticity.

• GP Data Loader – parallel loading infrastructure that supports “line speed” data loading into HDFS.

• Isilon Integration – extensively tested at scale with guidelines for compute-heavy, storage-heavy, and balanced configurations.

• Advanced Database Services (HAWQ)– high-performance, “True SQL” query interface running within the Hadoop cluster.

• Extensions Framework (GPXF) – support for HAWQ interfaces on external data providers (HBase, Avro, etc.).

• Advanced Analytics Functions (MADLib) – ability to access parallelized machine-learning and data-mining functions at scale.

GPHD Includes… Pivotal HD Adds the Following to GPHD…

Pivotal HD Value-Added Components

Page 15: Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights

15 © Copyright 2013 EMC Corporation. All rights reserved.

Component Version

Hadoop 1.0.3

HBase 0.92.1

Hive 0.8.1

Mahout 0.6

Pig 0.9.2

Zookeeper 3.3.5

Flume 1.2.0

Sqoop 1.4.1

Spring Hadoop

GPHD 1.2 Core Distribution Pivotal HD Enterprise

Pivotal Core Components & Versions

Component Version

Hadoop 2.0.2

HBase 0.94.2

Hive 0.9.1

Mahout 0.8.0

Pig 0.10.0

Zookeeper 3.4.3

Flume 1.2.0

Sqoop 1.4.1

Spring Hadoop

Page 16: Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights

16 © Copyright 2013 EMC Corporation. All rights reserved.

Pivotal HD Architecture

HDFS

HBase

Pig, Hive, Mahout

Map Reduce

Sqoop Flume

Resource Management & Workflow

Yarn

Zookeeper

Apache

Page 17: Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights

17 © Copyright 2013 EMC Corporation. All rights reserved.

Pivotal HD Architecture

HDFS

HBase

Pig, Hive, Mahout

Map Reduce

Sqoop Flume

Resource Management & Workflow

Yarn

Zookeeper

Deploy, Configure, Monitor, Manage

Command Center

Hadoop Virtualization (HVE)

Data Loader

Pivotal HD Enterprise

Apache Pivotal HD Enterprise

Page 18: Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights

18 © Copyright 2013 EMC Corporation. All rights reserved.

Pivotal HD Architecture

HDFS

HBase

Pig, Hive, Mahout

Map Reduce

Sqoop Flume

Resource Management & Workflow

Yarn

Zookeeper

Deploy, Configure, Monitor, Manage

Command

Center

Data Loader

Pivotal HD Enterprise

Apache Pivotal HD Enterprise HAWQ

Xtension Framework

Catalog Services

Query Optimizer

Dynamic Pipelining

ANSI SQL + Analytics

HAWQ– Advanced Database Services

Hadoop Virtualization (HVE)

Page 19: Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights

19 © Copyright 2013 EMC Corporation. All rights reserved.

DataLoader

.

.

Streams

Push

Pull

Connectors

Flume

HDFS

DataLoader

Data Source Registration

Copy Strategy

Optimization

Web GUI and CLI

Data Destination Registration

Data Copy

Job Management

Data Processing

REST APIs

Files

HDFS

NFS

HTTP

FTP

Local

Page 20: Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights

20 © Copyright 2013 EMC Corporation. All rights reserved.

Command Center Simple and complete cluster management

Install and configure Hadoop components and services

Centralized interface for Pivotal HD cluster monitoring, diagnostics, and management

Live and historical Hadoop system metrics analysis

Configure

Monitor

Manage

Analyze

Deploy

Page 21: Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights

21 © Copyright 2013 EMC Corporation. All rights reserved.

Command Center – Monitor, Manage, and Analyze Host, application, and job level

monitoring across the entire Pivotal HD cluster performance

Visualize and analyze live and historical Hadoop cluster information through Command Center Dashboard

Quick diagnostics of functional or performance issue

Page 22: Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights

22 © Copyright 2013 EMC Corporation. All rights reserved.

Hadoop Virtualization Extensions (HVE) • HVE enables Hadoop to support more effective virtual deployments

• This creates the opportunity to provision and scale the compute and storage processes independently resulting in:

• Much better resource utilization

• Improved resource allocation and consumption

• Support Multi-Tenancy

Page 23: Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights

23 © Copyright 2013 EMC Corporation. All rights reserved. 23 © Copyright 2013 Pivotal. All rights reserved.

HAWQ

Page 24: Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights

24 © Copyright 2013 EMC Corporation. All rights reserved.

HAWQ: The Crown Jewels of Greenplum SQL compliant

World-class query optimizer

Interactive query

Horizontal scalability

Robust data management

Common Hadoop formats

Deep analytics

Page 25: Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights

25 © Copyright 2013 EMC Corporation. All rights reserved.

High-Performance Query Processing HAWQ

Interactive and true ANSI SQL support

Multi-petabyte horizontal scalability

Cost-based parallel query optimizer

Programmable analytics

Page 26: Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights

26 © Copyright 2013 EMC Corporation. All rights reserved.

Enterprise-Class Database Services & Management HAWQ

Scatter-gather data loading

Row and column storage

Workload management

Multi-level partitioning

3rd-party tool & open client interfaces

Page 27: Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights

27 © Copyright 2013 EMC Corporation. All rights reserved.

Pre-integrated Deep Analytics HAWQ

Performance via fully parallelized implementation

Consistent, user friendly SQL interfaces

Ease of data preparation

Pre-integrated MADLib support – Linear Regression

– Logistic Regression – Multinomial Logisitic

Regression

– K-Means – Association Rules – PLDA - useful for topic

modeling

Page 28: Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights

28 © Copyright 2013 EMC Corporation. All rights reserved.

GPDB – Components

GPDB

Query Engine Catalog Service

Local File System Res

ourc

e M

anag

emen

t

GPXF

Planner Optimizer

Executor Transaction Manager

Page 29: Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights

29 © Copyright 2013 EMC Corporation. All rights reserved.

HAWQ – Components

GPSQL

Query Engine Catalog Service

HDFS

Res

ourc

e M

anag

emen

t

GPXF

Planner Optimizer

Executor Transaction Manager

Page 30: Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights

30 © Copyright 2013 EMC Corporation. All rights reserved.

HDFS Datanode

HAWQ Segment Host

HDFS Datanode

HAWQ Segment Host

HDFS Datanode

HAWQ Segment Host . . . Query Executor Query Executor Query Executor

Clients

JDBC/ODBC

SQL Console

SELECT beer, price FROM Bars b, Sells s WHERE b.name = s.bar AND b.city = ‘San Francisco’

HDFS Namenode

HAWQ Master Host

Query Optimizer

Query Parser

How HAWQ Works

Page 31: Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights

31 © Copyright 2013 EMC Corporation. All rights reserved.

HDFS Datanode

HAWQ Segment Host

HDFS Datanode

HAWQ Segment Host

HDFS Datanode

HAWQ Segment Host . . . Query Executor Query Executor Query Executor

Clients

JDBC/ODBC

SQL Console HDFS Namenode

HAWQ Master Host

Query Optimizer

Query Parser

How HAWQ Works Optimization

Context

Cost Model

Resources

Parse Tree

Metadata

Page 32: Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights

32 © Copyright 2013 EMC Corporation. All rights reserved.

HDFS Datanode

HAWQ Segment Host

HDFS Datanode

HAWQ Segment Host

HDFS Datanode

HAWQ Segment Host . . . Query Executor Query Executor Query Executor

Clients

JDBC/ODBC

SQL Console HDFS Namenode

HAWQ Master Host

Query Optimizer

Query Parser

How HAWQ Works Execution Plan

Page 33: Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights

33 © Copyright 2013 EMC Corporation. All rights reserved.

HDFS Datanode

HAWQ Segment Host

HDFS Datanode

HAWQ Segment Host

HDFS Datanode

HAWQ Segment Host . . . Query Executor Query Executor Query Executor

Clients

JDBC/ODBC

SQL Console HDFS Namenode

HAWQ Master Host

Query Optimizer

Query Parser

How HAWQ Works

Page 34: Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights

34 © Copyright 2013 EMC Corporation. All rights reserved.

HDFS Datanode

HAWQ Segment Host

HDFS Datanode

HAWQ Segment Host

HDFS Datanode

HAWQ Segment Host . . . Query Executor Query Executor Query Executor

Clients

JDBC/ODBC

SQL Console HDFS Namenode

HAWQ Master Host

Query Optimizer

Query Parser

How HAWQ Works

D y n a m i c P i p e l i n i n g ™

Page 35: Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights

35 © Copyright 2013 EMC Corporation. All rights reserved.

HDFS Datanode

HAWQ Segment Host

HDFS Datanode

HAWQ Segment Host

HDFS Datanode

HAWQ Segment Host . . . Query Executor Query Executor Query Executor

Clients

JDBC/ODBC

SQL Console HDFS Namenode

HAWQ Master Host

Query Optimizer

Query Parser

How HAWQ Works

Page 36: Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights

36 © Copyright 2013 EMC Corporation. All rights reserved.

HAWQ Deployment

Dynamic Pipelining

... ...

... ... Master

Servers & Name Nodes

Query planning & dispatch

Segment Servers & Data

Nodes Query processing &

data storage

External Sources Loading,

streaming, etc.

HDFS

ODBC/JDBC Driver

Page 37: Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights

37 © Copyright 2013 EMC Corporation. All rights reserved.

Xtension Framework An advanced version of GPDB

external tables

Enables combining HAWQ data and Hadoop data in single query

Supports connectors for HDFS, Hbase and Hive

Provides extensible framework API to enable custom connector development for other data sources

HDFS HBase Hive

Xtension Framework

Page 38: Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights

38 © Copyright 2013 EMC Corporation. All rights reserved.

HAWQ Benchmarks

User intelligence 4.2 198

Sales analysis 8.7 161

Click analysis 2.0 415

Data exploration 2.7 1,285

BI drill down 2.8 1,815

47X

19X

208X

476X

648X

Page 39: Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights

39 © Copyright 2013 EMC Corporation. All rights reserved.

Pivotal Analytics Workbench (AWB) Commitment to Accelerating Innovation & Contributing to the Apache Community • Multi-million dollar investment by Pivotal and partners

in a 1,000-node, 24-Petabyte cluster to facilitate innovation and conduct regular integration/scale testing of Apache Hadoop

• Full-time, dedicated integration onboarding projects and validating each release of Apache Hadoop at-scale

• Contributing back our results and findings to the open source community as well as incorporating them into the continued development of Pivotal HD

Page 40: Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights

40 © Copyright 2013 EMC Corporation. All rights reserved.

“Real” Hadoop Cluster

Page 41: Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights

41 © Copyright 2013 EMC Corporation. All rights reserved.

Leveraging Full Power of the Family

Page 42: Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights

42 © Copyright 2013 EMC Corporation. All rights reserved.

Pivotal Sessions at EMC World Session Presenter Dates/Times The Pivotal Platform: A Purpose-Built Platform for Big-Data-Driven Applications

Josh Klahr Tue 5:30 - 6:30, Palazzo E Wed 11:30 - 12:30, Delfino 4005

Pivotal: Data Scientists on the Front Line: Examples of Data Science in Action

Noelle Sio Tue 10:00 - 11:00, Lando 4205 Thu 8:30 - 9:30, Palazzo F

Pivotal: Operationalizing 1000-node Hadoop Cluster – Analytics Workbench

Clinton Ooi Bhavin Modi

Tue 11:30 - 12:30, Palazzo L Thu 10:00- 11:00 am, Delfino 4001A

Pivotal: for Powerful Processing of Unstructured Data For Valuable Insights

SK Krishnamurthy

Mon 4:00 - 5:00, Lando 4201 A Tue 4:00 - 5:00, Palazzo M

Pivotal: Big & Fast data – merging real-time data and deep analytics

Michael Crutcher

Mon 1:00 - 2:00, Lando 4201 A Wed 10:00 - 11:00, Palazzo M

Pivotal: Virtualize Big Data to Make The Elephant Dance June Yang Dan Baskette

Mon 11:30 - 12:30, Marcello 4401A Wed 4:00 - 5:00, Palazzo E

Hadoop Design Patterns Don Miner Mon 2:30 - 3:30, Palazzo F Wed 8:30 - 9:30, Delfino 4005

Page 43: Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights