apache hawq

25
© Copyright 2013 Pivotal. All rights reserved. Help Build the most Advanced SQL Database on Hadoop: Apache HAWQ Lei Chang [email protected] Apachecon 2015, Budapest, Europe

Upload: phamdieu

Post on 13-Feb-2017

228 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Apache HAWQ

© Copyright 2013 Pivotal. All rights reserved.

Help Build the most Advanced SQL Database on Hadoop: Apache HAWQ

Lei [email protected]

Apachecon 2015, Budapest, Europe

Page 2: Apache HAWQ

© Copyright 2013 Pivotal. All rights reserved.

• Introduction

• Architecture

• Query processing

• Catalog service

• Virtual segments

• Interconnect

• Transaction management

• Resource management

• Storage

• How to contribute

Page 3: Apache HAWQ

© Copyright 2013 Pivotal. All rights reserved.

HAWQ: A Hadoop Native Parallel SQL Engine Apache HAWQ

● Discover New Relationships● Enable Data Science ● Analyze External Sources● Query All Data Types!

Multi-level Fault Tolerance

Granular Authorization

Resource Pools

high multi-tenancy

ANSI SQL Standard

OLAP Extensions

JDBC ODBCConnectivity

MPP Architecture

Online Expansion

HDFS

Petabyte Scale

Cost Based Optimizer

Dynamic Pipelining

ACID + Transactional

Multi-LanguageUDF Support

Built-in Data Science Library

Extensible (PXF)Query External

Sources

Hardened, 10+ Years Tested, Production Proven

Accessibility + Usability

HDFS Native File Formats

● Manage Multiple Workloads● Petabyte Scale Analytics● Sub-second Performance

● Leverage Existing Skills & Tools

● Easily Integrate with Other Tools

Compression + Partitioning

core

com

plia

nce

● Well Integrated with Hadoop Ecosystem

Page 4: Apache HAWQ

© Copyright 2013 Pivotal. All rights reserved.

• Interactive queries

• Scalability

• Consistency

• Extensibility

• Standard compliance

• Productivity

Page 5: Apache HAWQ

© Copyright 2013 Pivotal. All rights reserved.

History

• 2011: Prototype • GoH (Greenplum Database on HDFS)

• 2012: HAWQ Alpha

• March 2013: HAWQ 1.0• Architecture changes for a Hadoopy system

• 2013~2014: HAWQ 1.x• HAWQ 1.1, HAWQ 1.2, HAWQ 1.3…

• 2015: HAWQ 2.0 Beta & Apache incubating• http://hawq.incubator.apache.org

Page 6: Apache HAWQ

© Copyright 2013 Pivotal. All rights reserved.

HAWQ components

Page 7: Apache HAWQ

© Copyright 2013 Pivotal. All rights reserved.

Yarn

Physical Segment

client

Masters Parser/AnalyzerOptimizer

Dispatcher

DataNode

NodeManager

NameNodeNameNode

External System

Resource Manager

Fault Tolerance Service

CatalogService

VirtualSegment

VirtualSegment

Physical Segment

DataNode

NodeManager

VirtualSegment

VirtualSegment

Physical Segment

DataNode

NodeManager

VirtualSegment

VirtualSegment

Resource Broker

libYARN

HDFS Catalog Cache

Interconnect Interconnect

HAWQ Architecture

Page 8: Apache HAWQ

© Copyright 2013 Pivotal. All rights reserved.

Basic query execution flow

ResourceManager

Page 9: Apache HAWQ

© Copyright 2013 Pivotal. All rights reserved.

Plan

SELECT l_orderkey, count(l_quantity)

FROM lineitem, orders

WHERE l_orderkey=o_orderkey AND l_tax>0.01

GROUP BY l_orderkey;

Motion:• Redistribution• Broadcast• Gather

Page 10: Apache HAWQ

© Copyright 2013 Pivotal. All rights reserved.

Catalog service

• Currently on master and external in future

• Query language: CaQL

• Basic single-table SELECT, COUNT(), multi-row DELETE, and single row INSERT/UPDATE

Page 11: Apache HAWQ

© Copyright 2013 Pivotal. All rights reserved.

Virtual segments & elasticity

• Segment is stateless

• Metadata in Catalog Service

• Self-described plan

• Compression

• Virtual Segment▪ Only one physical segment on each node

o Ease install and management

o Ease performance tuning

▪ Number of virtual segmentso Resource available at the query running time

o The cost of the query

• Node expansion & shrinking•

Master(UCS)

Segment

Segment

Segment

Page 12: Apache HAWQ

© Copyright 2013 Pivotal. All rights reserved.

Interconnect

Page 13: Apache HAWQ

© Copyright 2013 Pivotal. All rights reserved.

Interconnect

• Two transports: TCP & UDP• TCP interconnect limitations

– Limited scale– High Latency for connection setup

• Goals of UDP interconnect– Reliability– Ordering– Flow control– Performance and scalability– Portability

Page 14: Apache HAWQ

© Copyright 2013 Pivotal. All rights reserved.

Interconnect state machine

Page 15: Apache HAWQ

© Copyright 2013 Pivotal. All rights reserved.

The Design

• Multiple virtual connections share a common socket

• Multi-threaded

• Flow control: flow control window based

• Packet losses and out-of-order packets

• Deadlock elimination

Page 16: Apache HAWQ

© Copyright 2013 Pivotal. All rights reserved.

Transaction Management

• Snapshot isolation

• Catalog data: MVCC

• User data: Logical EoF & swimming lane

• Truncate to help remove garbage data

Page 17: Apache HAWQ

© Copyright 2013 Pivotal. All rights reserved.

Resource manager

● Responsibilityo Responsible for acquiring from YARN and return resources to YARN

o Responsible for resource allocation among HAWQ users, queries and

operators

● Two level resource managemento First level: integration with external resource manager (such as YARN…)

o Second level: HAWQ internal resource manager (user/query level resource

management)

o Third level: operator level

● Hierarchical resource queues

● CPU and memory management (allocation & enforcement)

Page 18: Apache HAWQ

© Copyright 2013 Pivotal. All rights reserved.

Resource manager componentsHAWQ Master

Optimizer

Query

Parser & Analyzer

YARN

Resource Manager (Leader)

Resource Allocator

Resource Pool

Resource Broker

PolicyStore

libYARN

Request Handler

DispatcherFTS (Fault Tolerance Service)

Plan

Mesos

Mesos API

from QD

Page 19: Apache HAWQ

© Copyright 2013 Pivotal. All rights reserved.

Storage

• Row oriented: AO – Quicklz, zlib

• PAX like: Parquet – Snappy, gzip

• Other format: via PXF

Page 20: Apache HAWQ

© Copyright 2013 Pivotal. All rights reserved.

Contributing to HAWQ

• Documentation• Wiki• Bug reports• Bug fixes• Features

• Website: http://hawq.incubator.apache.org/• Wiki: https://cwiki.apache.

org/confluence/display/HAWQ

• Repo: https://github.com/apache/incubator-hawq.git

• JIRA: https://issues.apache.org/jira/browse/HAWQ

• Mailing lists: dev/[email protected]

Page 21: Apache HAWQ

© Copyright 2013 Pivotal. All rights reserved.

Code contribution process• Start a JIRA• Fork a github repo: https://github.

com/apache/incubator-hawq.git

• Clone your repo to local• Add the github repo as “upstream”• Create a feature branch and commit your

code• Start a pull request for code review

Details:https://cwiki.apache.

org/confluence/display/HAWQ/Contributing+to+HAWQ

Page 22: Apache HAWQ

© Copyright 2013 Pivotal. All rights reserved.

Build & Setup

• Build– Option 1: Use pre-built docker image: https://hub.

docker.com/r/mayjojo/hawq-dev/– Option 2: Build dependencies by yourself

• https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=61320026

• Setup & Run– HDFS (required)– YARN (Optional)– HAWQ Init/start/stop cluster– psql -d postgres

Page 23: Apache HAWQ

© Copyright 2013 Pivotal. All rights reserved.

References

• HAWQ website:– http://hawq.incubator.apache.org– http://pivotal.io/big-data/pivotal-hawq

• HAWQ research papers– Lei Chang et al: HAWQ: a massively parallel processing SQL engine in hadoop.

SIGMOD Conference 2014: 1223-1234

– Mohamed A. Soliman et al: Orca: a modular query optimizer architecture for big data. SIGMOD Conference 2014: 337-348

– Lyublena Antova et al: Optimizing queries over partitioned tables in MPP systems. SIGMOD Conference 2014: 373-384

– Amr El-Helw et al: Optimization of Common Table Expressions in MPP Database Systems. PVLDB 8(12): 1704-1715 (2015)

Page 24: Apache HAWQ

© Copyright 2013 Pivotal. All rights reserved.

Summary

• HAWQ: A Hadoop Native Parallel SQL Engine

• How to contribute

Page 25: Apache HAWQ

© Copyright 2013 Pivotal. All rights reserved.