pivotal hawq 소개

58

Upload: seungdon-choi

Post on 13-Aug-2015

273 views

Category:

Technology


13 download

TRANSCRIPT

Page 1: Pivotal HAWQ 소개
Page 2: Pivotal HAWQ 소개

2 © 2015 Pivotal Software, Inc. All rights reserved.

Pivotal HAWQ 소개 Seungdon Choi Field Engineer Pivotal Korea

Page 3: Pivotal HAWQ 소개

3 © 2015 Pivotal Software, Inc. All rights reserved.

Agenda

�  Overview

�  Architecture

�  Machine Learning using HAWQ

�  Roadmap

�  Appendix: HAWQ vs Hive

Page 4: Pivotal HAWQ 소개

4 © 2015 Pivotal Software, Inc. All rights reserved.

Pivotal HAWQ is

 Enterprise  platform  that  provides  the  fewest  barriers,  lowest  risk,  most  cost  effective  and  fastest  way  to  enter  in  to  

big  data  analytics  on  Hadoop  

Page 5: Pivotal HAWQ 소개

5 © 2015 Pivotal Software, Inc. All rights reserved.

So What exactly HAWQ is?

Combining SQL with Hadoop is key for analytics

SQL remains #1 choice for Data Science •  Massively Parallel Processing RDBMS on

HADOOP

•  ANSI SQL on Hadoop

•  Extremely high performance for analytics (not

like Hive)

•  Stores all data directly on HDFS

•  Open-Source

•  ODP 코어 기반의 하둡 배포판에서 동작(PHD,

HDP, IBM..)

Page 6: Pivotal HAWQ 소개

6 © 2015 Pivotal Software, Inc. All rights reserved.

Why SQL on Hadoop?

1.  Map Reduce 문제점 1) Map Reduce 의 한계 : 느린 성능 개발 역량에 의존적. 버그 가능성 2) 높은 Learning Curve 3) Legacy System, App 들과의 호환성 문제 4) Ad-hoc query 성능 문제로 인해 DBMS 와 병행사용 불가피

2. SQL on Hadoop 사용 이유 1) ANSI SQL 지원 - 기존 시스템과 통합 혹은 대체 용이 , 개발 시간 단축

- 낮은 learning curve(기존 개발자들에게 편리) 2) 높은 처리 성능 : MR 한계 극복 3) 낮은 반응 시간 4) Legacy System/App 호환 가능(SAS, Tabulu등의 BI 툴 재 사용 가능) 5) 대화형 질의(Interactive Query) 사용

- 데이터 분석의 생산성 증가 à 빠른 의사 결정 가능

Page 7: Pivotal HAWQ 소개

7 © 2015 Pivotal Software, Inc. All rights reserved.

Pivotal HAWQ �  15년 이상 기업 시장에서 검증된 Greenplum 데이터베이스 엔진 사용

-  Partition, Compression, Resource 관리

�  100% ANSI SQL Compliant – 기존 BI, SAS 툴 재활용

�  실시간 쿼리 가능 - MapReduce를 사용하지 않고 분산되어 있는 데이터에 직접 접근

�  PXF External Table로 HDFS, HBase, Hive 및 다양한 데이터 통합

�  Libhdfs 개선(JavaàC)으로 일반 HDFS보다 빠른 속도

�  PL/R, Madlib 등 다양한 분석 패키지 지원

�  보안 및 유저 권한 관리, 암호화 지원

Page 8: Pivotal HAWQ 소개

8 © 2015 Pivotal Software, Inc. All rights reserved.

HAWQ Benefits… � Out of the box SQL for Hadoop

–  MapReduce Programming 러닝 커브 없이 SQL만으로 분석 수행

� GPFX External Tables providing SQL access to Hadoop –  HDFS, HBase, Hive or any data types 등 다양한 데이터 소스들의 통합 인터페이스

�  Broad data access, integration and portability

� 성능과 확장성, DW 구축하듯이 Big Data 프로젝트를 수행 –  Parallel Everything –  Dynamic Pipelining –  High Speed Interconnect –  Optimized HDFS access with libhdfs3

–  Co-Located Joins & Data Locality –  Partition Elimination –  Higher Cluster Utilization –  Concurrency Control

Page 9: Pivotal HAWQ 소개

9 © 2015 Pivotal Software, Inc. All rights reserved. 9

Architecture

Page 10: Pivotal HAWQ 소개

10 © 2015 Pivotal Software, Inc. All rights reserved.

Basic  Architecture    

Interconnect  

Catalog  

HAWQ  Master  

Local  TM  

Execu;on  Coordina;on  

Parser   Query  Op;mizer  

Dispatch   NameNode    

Local  Temp  Storage  

Segment  Host  Query  Executor  

HDFS  

PXF  

Segment  

[Segment  …]  

DataNode  

Local  Temp  Storage  

Segment  Host  Query  Executor  

HDFS  

PXF  

Segment  

[Segment  …]  

DataNode  

HDFS  

…  

HAWQ  Standby  Master  

 

Secondary  NameNode  

 HDFS  

HAWQ  

Page 11: Pivotal HAWQ 소개

11 © 2015 Pivotal Software, Inc. All rights reserved.

HAWQ  Master  

�  Client  의 SQL  request를 받아 이를 parsing  하여 각각의 Segment  Node로 전달하고, 수행 결과를 받아 Client  에 반납하는 역할을 수행  

�  실제 User  Data를 가지지 않고,  System  metadata를  저장하는 Global  System  Catalog를 가짐  

�  H/W장애시 역할을 위임받을 Standby  Master(Warm  Standby)  서버를 구성

�  운영 시스템 구성시는 일반적으로 Hadoop  NameNode와 별도의 서버에 설치  

Local  Storage  

HAWQ  Master  

Local  TM  

Query  Executor  

Parser   Query  Op;mizer  

Dispatch  

Catalog  

HAWQ  

Page 12: Pivotal HAWQ 소개

12 © 2015 Pivotal Software, Inc. All rights reserved.

HAWQ  Segments  �  A  HAWQ  segment  within  a  Segment  Host  is  an  HDFS  client  that  runs  on  a  DataNode  

�  하나의 Segment  Host/DataNode 에 여러개의 Segment  Node    

�  Segment  =  a  basic  unit  of  parallelism  –  Mul;ple  segments  work  together  to  form  a  single  

parallel  query  processing  system  

�  Opera;ons  (scans,  joins,  aggrega;ons,  sorts,  etc.)  execute  in  parallel  across  all  segments  simultaneously    

�  Libhdfs3(Pivotal  rewri[en)  를 사용하여 더 빠른 HDFS  R/W속도  

 

 

Local  Temp  Storage  

Segment  Host  

Query  Executor  

HDFS  

PXF  

Segment  

[Segment  …]  

DataNode  

HAWQ  

Page 13: Pivotal HAWQ 소개

13 © 2015 Pivotal Software, Inc. All rights reserved.

HAWQ Interconnect Performance and Scalability �  Inter-process communication between segments

–  Standard Ethernet switching fabric

�  Uses UDP protocol (User Datagram Protocol) –  성능과 확장성 향상

�  Additional packet verification and checking not performed by UDP –  Reliability equivalent to TCP

Interconnect

Local Temp Storage

Segment Host Query Executor

HDFS

PXF

Segment

Segment

DataNode

Local Temp Storage

Segment Host Query Executor

HDFS

PXF

Segment

Segment

DataNode

Page 14: Pivotal HAWQ 소개

14 © 2015 Pivotal Software, Inc. All rights reserved.

HAWQ Dynamic Pipelining tm

Local Temp Storage

Segment Host Query Executor

DataNode

PXF

Local Temp Storage

Segment Host Query Executor

DataNode

PXF

Local Temp Storage

Segment Host Query Executor

DataNode

PXF

•  Differentiating competitive advantage •  Core execution technology from GPDB •  Parallel data flow using the high speed UDP interconnect •  중간 결과값에 대한 No materialization

- MapReduce 와 다름

Dynamic Pipelining Interconnect

Page 15: Pivotal HAWQ 소개

15 © 2015 Pivotal Software, Inc. All rights reserved.

Interconnect

HAWQ Parser

Local Storage

HAWQ Master

Local TM

Query Executor

Parser Query Optimizer

Dispatch

PXF

NameNode

Local Temp Storage

Segment Host Query Executor

HDFS

PXF

Segment

Segment

DataNode

Local Temp Storage

Segment Host Query Executor

HDFS

PXF

Segment

Segment

DataNode

Clients

JDBC

SQL

•  Enforces syntax and semantics •  Converts SQL query into a

parse tree data structure describing details of the query

Page 16: Pivotal HAWQ 소개

16 © 2015 Pivotal Software, Inc. All rights reserved.

Interconnect

HAWQ Parallel Query Optimizer

Local Storage

HAWQ Master

Local TM

Query Executor

Parser Query Optimizer

Dispatch

PXF

NameNode

Local Temp Storage

Segment Host Query Executor

HDFS

PXF

Segment

Segment

DataNode

Local Temp Storage

Segment Host Query Executor

HDFS

PXF

Segment

Segment

DataNode

Gather Motion

Sort

HashAggregate

HashJoin

Redistribute Motion

HashJoin

Seq Scan on lineitem Hash

Seq Scan on orders

Hash

HashJoin

Seq Scan on customer Hash

Broadcast Motion

Seq Scan on nation

Page 17: Pivotal HAWQ 소개

17 © 2015 Pivotal Software, Inc. All rights reserved.

Interconnect

HAWQ Dispatch and Query Executor

Local Storage

HAWQ Master

Local TM

Query Executor

Parser Query Optimizer

Dispatch

PXF

NameNode

Local Temp Storage

Segment Host Query Executor

HDFS

PXF

Segment

Segment

DataNode

Local Temp Storage

Segment Host Query Executor

HDFS

PXF

Segment

Segment

DataNode

1.  Dispatch communicates the query plan to segments

2.  Query Executor executes the physical steps in the plan

ScanBarsb

HashJoinb.name = s.bar

ScanSellss Filterb.city = 'San Francisco'

Projects.beer, s.price

MotionGather

MotionRedist(b.name)

ScanBarsb

HashJoinb.name = s.bar

ScanSellss Filterb.city = 'San Francisco'

Projects.beer, s.price

MotionGather

MotionRedist(b.name)

Page 18: Pivotal HAWQ 소개

18 © 2015 Pivotal Software, Inc. All rights reserved.

Pivotal Query Optimizer (PQO) For HAWQ and Greenplum Database

HAWQ

Turns a SQL query into an execution plan

Greenplum DB

�  First Cost Based Optimizer for BIG data �  Applies all possible optimizations at the same time �  New Extensible Code Base �  Rapid adoption of emerging technologies

PIVOTAL VALUE-ADDED FUNCTIONALITY

Page 19: Pivotal HAWQ 소개

19 © 2015 Pivotal Software, Inc. All rights reserved.

HAWQ Transactions �  DataNodes in HDFS do not know what is visible

–  No idea what data they have –  Visibility is defined by the NameNode

�  Therefore, segment nodes do not know what is visible –  Visibility is defined by HAWQ Master

�  No distributed transaction management –  No UPDATE or DELETE

�  Truncate is implemented to support rollback of failed transactions

�  Transaction logs present only on HAWQ Master –  For inserts, single phase commit performed on HAWQ Master

Local Storage

HAWQ Master

Local TM

Query Executor

Parser Query Optimizer

Dispatch

Catalog

Page 20: Pivotal HAWQ 소개

20 © 2015 Pivotal Software, Inc. All rights reserved.

HAWQ Fault Tolerance

� HDFS replication을 사용한 Fault tolerance 보장

� Replication factor decided when creating a file-space & table-space related to HDFS

–  Default is 3

� When a segment server goes down shard is accessible from another node

–  No data stored for mirrors

� Recovery of segment through regular gprecoverseg

Page 21: Pivotal HAWQ 소개

21 © 2015 Pivotal Software, Inc. All rights reserved.

HAWQ Availability

HDFS DataNode

Segment 2

HDFS DataNode

Segment 3

Segment 1

Replication is embedded in HDFS so GPDB file replication is not needed

When a segment fails the shard is accessible from another node through the

HDFS NameNode and then the DataNode to where the

shard was replicated

Master Host

HDFS NameNode

Page 22: Pivotal HAWQ 소개

22 © 2015 Pivotal Software, Inc. All rights reserved.

Pivotal HAWQ – Polymorphic AO Storage GPDB와 동일한 유연한 Row/column based table/partition 구성으로 성능 및 저장공간 최적화

�  Columnar storage is well suited to scanning a large percentage of the data

�  Row storage excels at small lookups �  Most systems need to do both �  Row and column orientation can be

mixed within a table or database

�  Both types can be dramatically more efficient with compression

�  Compression is definable column by column: �  Blockwise: Gzip1-9 & QuickLZ �  Streamwise: Run Length Encoding (RLE) (levels 1-4)

�  Flexible indexing, partitioning enable more granular control and enable true ILM

TABLE ‘SALES’ Mar Apr May Jun Jul Aug Sept Oct Nov

Row-oriented for Small Scans Column-oriented for Full Scans

Page 23: Pivotal HAWQ 소개

23 © 2015 Pivotal Software, Inc. All rights reserved.

HAWQ Master Mirroring

Master Host

HDFS NameNode

Global System Catalog

� Master node와 별도의 H/W에 Standby Master node 구성

� Transaction Log를 실시간 복제하여 데이터 정합성 보장(Warm standby). Master Node 장애시 Standby 가 Roll을 위임받음

� System catalogs synchronized

Synchronization Process

Transaction Logs

Master Host

HDFS NameNode

Global System Catalog

Transaction Logs

Page 24: Pivotal HAWQ 소개

24 © 2015 Pivotal Software, Inc. All rights reserved.

HAWQ Storage Options

� Table in HAWQ: ü Distributed ü Partition(range/list) ü Polymorphic Storage ü Row/Columnar Oriented ü Compress(zlib,quicklz,RL

E..)

TABLE  A  

SEG-­‐1   SEG-­‐2   SEG-­‐3   SEG-­‐4   …   SEG-­‐N  

PART  A  

ROW  

COLUMNAR  

COMPRESS  

SUB-­‐PART  

SUB-­‐PART  

PART  A  

ROW  

COLUMNAR  

COMPRESS  

SUB-­‐PART  

SUB-­‐PART  

PART  A  

ROW  

COLUMNAR  

COMPRESS  

SUB-­‐PART  

SUB-­‐PART  

PART  A  

ROW  

COLUMNAR  

COMPRESS  

SUB-­‐PART  

SUB-­‐PART  

DISTRIBUTION  

PARTITIONS  

POLYMORPHIC  STORAGE  

Page 25: Pivotal HAWQ 소개

25 © 2015 Pivotal Software, Inc. All rights reserved.

Flat  Files,  CSV,  Delimited,  …  

gpload,  gpfdist,  External  Tables   PXF    {Native  Hadoop  Files}   Spring  XD  

Existing  RDBMS  Systems  

Web  Tables,  JSON,  XML,  HTML,  …  

Executing  Scripts,  …  

HDFS  Flat  Files,  CSV,  Delimited,  …  

Hive  

HBase  {w.  predicate  push-­‐down}    

Avro,  RCFile,  SeqFile  

Open  extendable  API  

Available:  Accumulo,  JSON,…  

Streaming  ||  Batch  Mode    

Java  Development  Framework  

Parallel  loading/unloading  at  Scale   HAWQ

Page 26: Pivotal HAWQ 소개

26 © 2015 Pivotal Software, Inc. All rights reserved.

Pivotal eXtension Framework (PXF) �  External Table Interface 제공: Hadoop eco의 다양한

data store 를 조회

�  Hadoop à HAWQ 로 데이터 적재 혹은 직접 쿼리

�  Enables combining HAWQ data and Hadoop data in single queryaa

�  Supports connectors for HDFS, HBase and Hive

�  Provides extensible framework API to enable custom connector

�  Available on Github: JSON, Accumulo, S3…

�  HAWQ MapReduce RecordReader

PIVOTAL-­‐HD  EXTENSION  FRAMEWORK  

HDFS   HBase   Hive  

 

Industry  differen;ators  :    •  Low  latency  on  large  data  sets  •  Extensible  and  customizable  •  Considers  cost  model  of  federated  sources  

HAWQ  

HAWQ  

Page 27: Pivotal HAWQ 소개

27 © 2015 Pivotal Software, Inc. All rights reserved.

PXF Features �  Hbase,Hive 의 연계시 filter 조건으로 predicate push down

�  Hive table Partitioning exclusion

�  HDFS data에 대한 통계정보 수집으로 최적화된 수행 계획 작성

�  Extensible Framework JAVA API 제공으로 다른 데이터소스(eg: Oracle DB)/format에 대한 custom 개발 용이

�  HDFS block locality to HAWQ processing segment

�  빠른 Parallel Optimizer(ORCA)

�  사용예: (1)HAWQ 의 Dimension Table 과 HBase fact table과 Join (2)HDFS, Hive, HBase 데이터를 빠르게 HAWQ로 로드하여 통합 관리 (3)다양한 포맷과 저장소의 데이터에 대한 materialization없이 통합(federation) 쿼리 엔진으로 사용

� 

Page 28: Pivotal HAWQ 소개

28 © 2015 Pivotal Software, Inc. All rights reserved.

PXF External Table 예제 �  Simple HDFS Text CREATE EXTERNAL TABLE jan_2012_sales (

id int, total int, comments varchar

)LOCATION(‘pxf://10.76.72.26:50070/sales/2012/01/items_*.csv? profile=HdfsTextSimple )FORMAT ‘TEXT’ (delimiter ‘,’);

CREATE EXTERNAL TABLE hbase_sales (recordkey bytea, “cf1:saleid” int, “cf8:comments” varchar

)LOCATION(‘pxf://10.76.72.26:50070/sales? profile=HBase )FORMAT ‘custom’ (formatter='gpxfwritable_import');

�  Hbase Table

CREATE WRITABLE EXTERNAL TABLE ...LOCATION(‘pxf://<host:port>/sales?profile=HdfsTextSimple&COMPRESSION_CODEC=org.apache.hadoop.io.compress.GzipCodec')FORMAT ‘text’(delimiter ‘,’);

�  Export to HDFS using Writable PXF

Page 29: Pivotal HAWQ 소개

29 © 2015 Pivotal Software, Inc. All rights reserved.

Data Distribution

DN3  DN2  

X=2   X=3   X=4   X=5  X=1  

Table  A  

Y=2   Y=3  Y=1  

Table  B  

DN1  

SELECT  X  FROM  A,B  WHERE  A.X  =  B.Y  

SELECT  SUM(X)  FROM  A  GROUP  BY  A.X  

�  특정 Column/Column Set/Random에 기반한 데이터 분산

�  Tables distributed similarly are co-located

�  Distribution scheme modifiable thru alter table

Advantages:

�  Co-located joins

�  No data movement on joins or aggregates

�  Improved performance on complex queries

�  Query engine optimization

Page 30: Pivotal HAWQ 소개

30 © 2015 Pivotal Software, Inc. All rights reserved.

HAWQ Distribution vs Hive Partition

DN1  

Table  A  

NO  CO-­‐LOCATED  JOINS,  NO  CO-­‐LOCATED  AGGREGATES  

FOLDER  b   FOLDER  c  FOLDER  a  

X=2   X=3   X=4   X=5  X=1  

DN2   DN3  

Table  B  

FOLDER  bb  FOLDER  aa  

Y=2   Y=3  Y=1  

DATA

 IS  

SPRE

AD  O

N  HD

FS  

�  In Hive partitions are organized into folders

�  Folders are spread across entire HDFS

�  Similar data are not co-located, data location is lost

�  Data movement is required for large joins and aggregates

�  Hive partitions help in sequential scan of the original table only

Page 31: Pivotal HAWQ 소개

31 © 2015 Pivotal Software, Inc. All rights reserved.

HAWQ Resource Management �  쿼리 우선 순위 부여를 통한 효과적인 Mixed Workload 관리

�  # of active query / memory / CPU/ disk IO 에 대한 queue 관리

�  다양한 SLA 설정 및 동적 Queue설정 변경 가능(주간/일간/시간)

동시 쿼리 수행 수 제어

Max Cost 값 제어

Min Cost 값 제어

쿼리 우선 순위

Max Cost 값 이상 쿼리 사전 차단 기능

Page 32: Pivotal HAWQ 소개

32 © 2015 Pivotal Software, Inc. All rights reserved. 32

MACHINE LEARNING ON HDFS Using HAWQ

Page 33: Pivotal HAWQ 소개

33 © 2015 Pivotal Software, Inc. All rights reserved.

MADlib  Advantages  

� Be[er  parallelism  –  Algorithms  designed  to  leverage  MPP  and  Hadoop  

architecture  � Be[er  scalability  

–  Algorithms  scale  as  your  data  set  scales  � Be[er  predic;ve  accuracy  

–  Can  use  all  data,  not  a  sample  � Open  source  

–  Available  for  customiza;on  and  op;miza;on  by  user  if  desired  

HAWQ  

Page 34: Pivotal HAWQ 소개

34 © 2015 Pivotal Software, Inc. All rights reserved.

Functions

Predictive Modeling Library

Linear Systems •  Sparse and Dense Solvers •  Linear Algebra

Matrix Factorization •  Single Value Decomposition (SVD) •  Low Rank

Generalized Linear Models •  Linear Regression •  Logistic Regression •  Multinomial Logistic Regression •  Cox Proportional Hazards Regression •  Elastic Net Regularization •  Robust Variance (Huber-White), Clustered

Variance, Marginal Effects

Other Machine Learning Algorithms •  Principal Component Analysis (PCA) •  Association Rules (Apriori) •  Topic Modeling (Parallel LDA) •  Decision Trees •  Random Forest •  Support Vector Machines •  Conditional Random Field (CRF) •  Clustering (K-means) •  Cross Validation •  Naïve Bayes •  Support Vector Machines (SVM)

Descriptive Statistics

Sketch-Based Estimators •  CountMin (Cormode-Muth.) •  FM (Flajolet-Martin) •  MFV (Most Frequent Values) Correlation Summary

Support Modules

Array Operations Sparse Vectors Random Sampling Probability Functions Data Preparation PMML Export Conjugate Gradient

Inferential Statistics

Hypothesis Tests

Time Series •  ARIMA

Oct 2014

Page 35: Pivotal HAWQ 소개

35 © 2015 Pivotal Software, Inc. All rights reserved.

Calling  MADlib  Func;ons:  Fast  Training,  Scoring  

SELECT  madlib.linregr_train(  'houses’,  'houses_linregr’,  

'price’,  'ARRAY[1,  tax,  bath,  size]’);  

MADlib  model  func;on  Table  containing  training  data  

Table  in  which  to  save  results  

Column  containing  dependent  variable  Features  included  in  the  

model  

�  MADlib  allows  users  to  easily  and  create  models  without  moving  data  out  of  the  systems  

–  Model  genera;on  –  Model  valida;on  –  Scoring  (evalua;on  of)  new  data  

�  All  the  data  can  be  used  in  one  model  

�  Built-­‐in  func;onality  to  create  of  mul;ple  smaller  models  (e.g.  classifica;on  grouped  by  feature)  

�  Open-­‐source  lets  you  tweak  and  extend  methods,  or  build  your  own  

HAWQ  

Page 36: Pivotal HAWQ 소개

36 © 2015 Pivotal Software, Inc. All rights reserved.

UDF – pl/x : 다양한 분석용 언어 사용 �  R/Python/Java/C/Perl, Pgsql을 사용한 User Defined Function을 사용

�  Numpy, NLTK, Scikit-learn, Scipy등의 python extension 사용

�  MPP Architecture의 Data Parallelism 을 사용하여 빠른 분석 성능 제공

Standby  Master  

…  

Master  Host  

SQL  

Interconnect  

Segment  Host  Segment  Segment  

Segment  Host  Segment  Segment  

Segment  Host  Segment  Segment  

Segment  Host  Segment  Segment  

Standby  Master  

…  

Master  Host  

SQL  

Interconnect  

Segment  Host  Segment  Segment  

Segment  Host  Segment  Segment  

Segment  Host  Segment  Segment  

Segment  Host  Segment  Segment  

Page 37: Pivotal HAWQ 소개

37 © 2015 Pivotal Software, Inc. All rights reserved.

PivotalR: Bringing MADlib and HAWQ to a familiar R interface �  Challenge

Want to harness the familiarity of R’s interface and the performance & scalability benefits of in-DB analytics

�  Simple solution: Translate R code into SQL

d <- db.data.frame(”houses")houses_linregr <- madlib.lm(price ~ tax

+ bath+ size, data=d)

Pivotal R SELECT madlib.linregr_train( 'houses’,

'houses_linregr’,'price’,

'ARRAY[1, tax, bath, size]’);

SQL Code

https://github.com/pivotalsoftware/PivotalR

Page 38: Pivotal HAWQ 소개

38 © 2015 Pivotal Software, Inc. All rights reserved.

PivotalR Design Overview

2. SQL to execute

3. Computation results 1. R à SQL

RPostgreSQL

PivotalR

Data lives here No data here

Database/Hadoop w/ MADlib

•  Call MADlib’s in-DB machine learning functions directly from R

•  Syntax is analogous to native R function

•  Data doesn’t need to leave the database •  All heavy lifting, including model estimation

& computation, are done in the database

Page 39: Pivotal HAWQ 소개

39 © 2015 Pivotal Software, Inc. All rights reserved.

Security  &  Authoriza;on  �  Role based security

�  Availability of Users, Groups

�  Access granularity on Connection, Databases, Schema, Tables, View, …

�  Inheritance: –  Inherit security privileges from other users or groups for easy administration

�  Assign groups and users to Resource Queues

�  Secure connection between HAWQ processes

�  Built-in column encryption (pgcrypto)

HAWQ  

Page 40: Pivotal HAWQ 소개

40 © 2015 Pivotal Software, Inc. All rights reserved.

HAWQ Client Program - pgAdmin HAWQ 및 GPDB 를 위한 Client 툴

Page 41: Pivotal HAWQ 소개

41 © 2015 Pivotal Software, Inc. All rights reserved.

HAWQ Client Program - Aginitiy Workbench Aginitiy Workbench for EMC Greenplum - HAWQ / Greenplum database 사용자를 위한 DBA 및 개발자를 위한 Client 프로그램 - 한글 처리 및 다양한 다국어를 지원

Page 42: Pivotal HAWQ 소개

42 © 2015 Pivotal Software, Inc. All rights reserved. 42

Pivotal HAWQ New/Enhanced Feature & Roadmap Apr 2015

Page 43: Pivotal HAWQ 소개

43 © 2015 Pivotal Software, Inc. All rights reserved.

PHD 3.0 & HAWQ 1.3.X (2015년 H1) 구분 기능 개선 사항 개선 내용 GPDB

Version

관리

ODP Core 산업 표준의 ODP core를 기반의 Hadoop - HDFS, YARN, MapReduce, Ambari

PHD 3.0/ HAWQ 1.3

Ambari 적용 PHD 관리, 운영, 모니터링(Ganglia), Alert(Nagios) 강화 PHD 3.0/ HAWQ 1.3

보안 보안 개선 관리(Ranger, Ambari), 인증(Kerberos, Knox), 권한 관리(ACLs, AD/LDAP, Ranger), Audit(Ranger), Data Protection(Encryption) 강화

PHD 3.0/ HAWQ 1.3

Eco 시스템

최신 버전 및 에코 시스템 지원

Hadoop 2.6 기반 Spark stack 포함 Know, Ranger 추가

PHD 3.0/ HAWQ 1.3

Page 44: Pivotal HAWQ 소개

44 © 2015 Pivotal Software, Inc. All rights reserved.

PHD 3.X & HAWQ 2.X Roadmap(2015년 H2) 구분 기능 개선 사항 개선 내용 GPDB

Version

성능

MV In memory 기반의 Materialized Views 제공 HAWK 2.X

파티션 multilevel partitioning 성능 개선 HAWK 2.X

관리

Resource 관리 계층구조의 Resource 관리 HAWK 2.X

YARN HAWQ의 Resource 관리를 YARN에 Plugin 으로 구성하여 YARN에서 시스템 리소스 통합 관리 HAWK 2.X

기능 Hcatalog HAWQ와 HCatalog 통합 관리 HAWK 2.X

호환성 Isilon 지원 EMC Isilon 지원 - Scale out NAS 스토리지(Isilon)이용한 하둡클러스터 지원 - 100TB 이상의 HDFS 구성 시 효과적임.

HAWK 2.X

Page 45: Pivotal HAWQ 소개

45 © 2015 Pivotal Software, Inc. All rights reserved.

What’s in HAWQ 1.3 •  New Ambari Installation experience •  Enhancement to Query Optimizer & Query Execution •  Incremental Analyze on tables •  HAWQ 1.3.0.1 support for HDP 2.4.2.2 •  libhdfs3 updates & HDFS support for truncate patch •  HAWQ 1.3.0.2 support for SLES •  Documentation enhancements on administration, etc…

Page 46: Pivotal HAWQ 소개

46 © 2015 Pivotal Software, Inc. All rights reserved.

HAWQ Roadmap

•  First Half 2015: (1.3.x) •  Ambari 2.0: Advanced Monitoring & Alerting, StackAdvisor •  Migration from 1.2 line into 1.3 •  Isilon DA support

•  Second Half 2015: (2.x) •  Isilon support •  Elastic Runtime (NxM): Performance, Higher concurrency, Cloud optimized •  Advanced Resource Manager: Hierarchical, Highly Multi-tenant, YARN •  HCatalog Integration •  AWS enablement •  Improved Support for multilevel partitioning •  Open Sourcing into ASF

Page 47: Pivotal HAWQ 소개

47 © 2015 Pivotal Software, Inc. All rights reserved. 47

Appendix: HAWQ vs Hive Advantages over Apache Hive Apr 2015

Page 48: Pivotal HAWQ 소개

48 © 2015 Pivotal Software, Inc. All rights reserved.

HAWQ 장점 1: Performance �  MPP 병렬 처리 엔진으로 빠른 분석 속도 및 interactive 쿼리 지원

�  Big Data에 최적화된 Cost Base Optimizer – PQO (Pivotal Query Optimizer)

�  Hive와 같은 Map Reduce 변환이 아닌 직접 쿼리 처리로 보다 나은 동시성 쿼리 지원 �  더 많은 사용자 쿼리를 더 작은 서버 리소스로 처리

Page 49: Pivotal HAWQ 소개

49 © 2015 Pivotal Software, Inc. All rights reserved.

HAWQ 장점 2: 100% ANSI SQL 지원 �  100% ANSI SQL 문법 지원

�  복잡한 조인 및 분석 쿼리 지원

�  기존 사용하던 BI 툴들을 변경 없이 사용

�  러닝 커브 제거로 빠른 개발 구축 가능

TPC-DS Query: HAWQ : 111 개 모든 쿼리가 수정없이 수행가능

Stinger: 20 Impala : 31 Presto: 12

Page 50: Pivotal HAWQ 소개

50 © 2015 Pivotal Software, Inc. All rights reserved.

HAWQ 장점 3: 다양한 분석 기능 제공 �  Data Scientist를 위한 MADlib, PivotalR,

PL/R, PL/Python, PL/Java등 다양한 open source 분석 통계 툴 제공

�  샘플이 아닌 전 구간의 데이터 셋 분석을 병렬 처리하여 빠른 분석 속도 보장

�  In Database 분석으로 데이터 이동 불필요

Standby  Master  

…  

Master  Host  

SQL  

Interconnect  

Segment  Host  Segment  Segment  

Segment  Host  Segment  Segment  

Segment  Host  Segment  Segment  

Segment  Host  Segment  Segment  

Page 51: Pivotal HAWQ 소개

51 © 2015 Pivotal Software, Inc. All rights reserved.

HAWQ 장점 4: 다양한 소스에 대한 확장성 �  PXF(Pivotal eXtension Framework)

�  HAWQ 와 다양한 소스데이터(HDFS,Hive,Avro, Hbase..) 를 조인하여 통합 쿼리 기능 제공

�  확장가능한 API Framework 제공으로 보다 많은 데이터 소스와 연계 가능(eg:Oracle,DB2,JSON..)

�  외부 테이블에 대한 병렬 처리로 빠른 데이터 조회 가능

PIVOTAL-­‐HD  EXTENSION  FRAMEWORK  

HDFS   HBase   Hive  

HAWQ  

Page 52: Pivotal HAWQ 소개

52 © 2015 Pivotal Software, Inc. All rights reserved.

HAWQ 장점 5: 통합된 모니터링 및 관리툴 �  PHD3.0 – 오픈 소스 Ambari와 통합

�  손쉬운 설치 및 관리 툴 제공

�  타 Hadoop 제품들과 통합된 모니터링 제공

�  ODP(Open Data Platform) 공유로 12개+ 협업 벤더의 하둡 배포판에서 수정 없이 실행 가능한 호환성 제공

Page 53: Pivotal HAWQ 소개

53 © 2015 Pivotal Software, Inc. All rights reserved.

다양한 워크로드에 대한 Pivotal 기술군 HAWQ

Hive  HAWQ   HAWQ   HAWQ   SparkSQL  

SpringXD  

Batch    SQL    

• Minutes,  hours    •  IO  heavy  •  Less  complex  

Interac;ve  SQL      

•  Seconds,  minutes  •  Joins  •  Extensibility  

OLAP  SQL    

•  Seconds  •  Very  complex  •  BI  Tools  

Streaming  SQL    

•  In-­‐memory  •  Small  dataset  

Page 54: Pivotal HAWQ 소개

54 © 2015 Pivotal Software, Inc. All rights reserved.

요약: Apache Hive VS Pivotal HAWQ HAWQ

Apache Hive Pivotal HAWQ 복잡한 조인 조건 지원 미지원 복잡한 조인 조건도 빠르게 처리

기사용 분석 BI 툴 호환성 미호환 툴 다수로 투자 증가 호환성 보장으로 추가 투자 X

Interactive query 지원 성능 이슈 존재 배치잡에만 최적화됨

큰 data set에 대한 빠른 interactive query

Ad-hoc query 지원 성능 이슈 존재

ad-hoc query 에 최적화된 cost base optimizer 탑재

ANSI SQL 지원 제한적 ANSI SQL 지원으로 호환성 문제

100% SQL compliance 지원

동시성 쿼리 지원 쿼리 동시성 처리가 힘듬 mixed workload 에 대한 쿼리 동시성 확보

Page 55: Pivotal HAWQ 소개

55 © 2015 Pivotal Software, Inc. All rights reserved.

요약: HAWQ Business Benefits Feature Benefit 풍부하고 호환성 있는 SQL dialect •  Powerful and portable SQL apps

•  Leverage large SQL-based ecosystems

TPC-DS compliance •  보다 다양한 use case 적용 가능, 기존 BI 톨과의 호환 보장

•  안정적인 운영 가능

선형적 확장성과 유연하고 효율적인 조인 조건 지원

EDW 부하의 절감을 매우 작은 비용으로 가능

Deep analytics + machine learning Predictive/advanced learning use cases at scale

Data 통합 기능 제공 데이터의 이동 없이 다양한 외부 데이터를 통합 조회

고가용성 보장 EDW 로 부터 주요 업무 하둡으로 이관 가능

Native Hadoop file format support Reduce ETL and data movement = lower costs

HAWQ

Page 56: Pivotal HAWQ 소개

56 © 2015 Pivotal Software, Inc. All rights reserved. 56

Spark–HAWQ Integration

Page 57: Pivotal HAWQ 소개

57 © 2015 Pivotal Software, Inc. All rights reserved.

Spark approaches to read HAWQ data •  Spark JDBC (JdbcRDD, DBInputFormat)

•  Spark with HAWQInputFormat (AO, Parquet)

•  Shared Parquet Storage

•  Apache Crunch-Spark (HAWQInputFormat2)

Page 58: Pivotal HAWQ 소개