what is hawq?...apache hawq (incubating) is an elastic parallel processing sql engine that runs...

43
HADOOP NATIVE SQL

Upload: others

Post on 20-May-2020

19 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: What is HAWQ?...Apache HAWQ (incubating) Is an elastic parallel processing SQL engine that runs native in Apache™ Hadoop® to directly ... HAWQ Architecture. 1986 … 1994 1996 1998

HADOOP NATIVE SQL

Page 2: What is HAWQ?...Apache HAWQ (incubating) Is an elastic parallel processing SQL engine that runs native in Apache™ Hadoop® to directly ... HAWQ Architecture. 1986 … 1994 1996 1998

What is HAWQ?

Page 3: What is HAWQ?...Apache HAWQ (incubating) Is an elastic parallel processing SQL engine that runs native in Apache™ Hadoop® to directly ... HAWQ Architecture. 1986 … 1994 1996 1998

Apache HAWQ (incubating)

Is an elastic parallel processing SQL engine that runs native in Apache™ Hadoop® to directly access data for advanced analytics.

Page 4: What is HAWQ?...Apache HAWQ (incubating) Is an elastic parallel processing SQL engine that runs native in Apache™ Hadoop® to directly ... HAWQ Architecture. 1986 … 1994 1996 1998

Why HAWQ?

Page 5: What is HAWQ?...Apache HAWQ (incubating) Is an elastic parallel processing SQL engine that runs native in Apache™ Hadoop® to directly ... HAWQ Architecture. 1986 … 1994 1996 1998

Hadoop Native SQL is a business imperative

Page 6: What is HAWQ?...Apache HAWQ (incubating) Is an elastic parallel processing SQL engine that runs native in Apache™ Hadoop® to directly ... HAWQ Architecture. 1986 … 1994 1996 1998

1. Hadoop: the new Data WarehouseData is moving out of traditional data warehouses and into Apache Hadoop.

● IT’S ABOUT COST

● IT’S ABOUT COLLABORATION

● IT’S ABOUT OPEN SOURCE

● IT’S ABOUT SCALE

● IT’S ABOUT ANALYTICS

● IT’S ABOUT CLOUD

IT’S ABOUT SQL!

SQL continues to me the Most Valuable workload on Hadoop today

Page 7: What is HAWQ?...Apache HAWQ (incubating) Is an elastic parallel processing SQL engine that runs native in Apache™ Hadoop® to directly ... HAWQ Architecture. 1986 … 1994 1996 1998

MASHING BIG DATA WITH BIG MACHINES IS ‘BEAUTIFUL, DESIRABLE, INVESTABLE’

- IT COULD TRANSFORM GE'S BUSINESS - AND THE ECONOMY.

”Jeff Immelt, CEO, GE

Page 8: What is HAWQ?...Apache HAWQ (incubating) Is an elastic parallel processing SQL engine that runs native in Apache™ Hadoop® to directly ... HAWQ Architecture. 1986 … 1994 1996 1998

Sophisticated Analyticsdrive competitive advantage

Page 9: What is HAWQ?...Apache HAWQ (incubating) Is an elastic parallel processing SQL engine that runs native in Apache™ Hadoop® to directly ... HAWQ Architecture. 1986 … 1994 1996 1998

2. The rise of the Data ScientistData science enables leveraging data assets for competitive advantage.

● Data science 800% growth in two years[1]

● Needs tools capable of rich analytics handling of massive data

● SQL and Machine Learning are two powerful enabling tools

● Deep ANSI SQL compliance is a requirement for many existing tools

IT’S ABOUT PREDICTIVE INSIGHTS!

[1] source indeed.com http://www.indeed.com/jobtrends?q=Data-science&relative=1

Page 10: What is HAWQ?...Apache HAWQ (incubating) Is an elastic parallel processing SQL engine that runs native in Apache™ Hadoop® to directly ... HAWQ Architecture. 1986 … 1994 1996 1998

Hadoop Native SQL must embrace the Hadoop ecosystem

Page 11: What is HAWQ?...Apache HAWQ (incubating) Is an elastic parallel processing SQL engine that runs native in Apache™ Hadoop® to directly ... HAWQ Architecture. 1986 … 1994 1996 1998

3. Hadoop SQL ecosystem

Apache HAWQ

(incubating)

Apache Hive

Apache Drill

Cloudera Impala

100% Apache Governance Yes Yes Yes No

Native HCatalog Integration Yes Yes No Yes

Native Yarn Integration Yes Yes Yes Yes

Native Ambari Integration Yes Yes No No

Support ACID consistency Yes Yes No No

Native Machine Learning Yes No No No

Row Level Security Yes No No Yes*

Focus Low Latency & Analytic Queries

Simple Batch

Schema detection

Low latencyQueries

Page 12: What is HAWQ?...Apache HAWQ (incubating) Is an elastic parallel processing SQL engine that runs native in Apache™ Hadoop® to directly ... HAWQ Architecture. 1986 … 1994 1996 1998

SQL Patterns

Page 13: What is HAWQ?...Apache HAWQ (incubating) Is an elastic parallel processing SQL engine that runs native in Apache™ Hadoop® to directly ... HAWQ Architecture. 1986 … 1994 1996 1998

Scalable Performance drives rapid iteration

Page 14: What is HAWQ?...Apache HAWQ (incubating) Is an elastic parallel processing SQL engine that runs native in Apache™ Hadoop® to directly ... HAWQ Architecture. 1986 … 1994 1996 1998

4. TPC-DS Performance - Impala

HAWQFaster

ImpalaFaster

• HAWQ Faster on 45 / 60 TPC-DS queries completed*• 4.55x mean avg.• 12 hrs faster total

* Impala supported 74 / 99 queries and 12 crashed mid-run

Page 15: What is HAWQ?...Apache HAWQ (incubating) Is an elastic parallel processing SQL engine that runs native in Apache™ Hadoop® to directly ... HAWQ Architecture. 1986 … 1994 1996 1998

4. TPC-DS Performance - Hive w / Tez• HAWQ Faster on 46 / 62 TPC-DS queries completed*• 3.44x mean avg.• 9 hrs faster total

* Hive supported 60 / 99 queries and 5 crashed mid-run

HAWQFaster

ImpalaFaster

Page 16: What is HAWQ?...Apache HAWQ (incubating) Is an elastic parallel processing SQL engine that runs native in Apache™ Hadoop® to directly ... HAWQ Architecture. 1986 … 1994 1996 1998

5. TPC-DS - Standards Support

* Impala required rewriting date ranges to support partition elimination

TPC-DS Query 46

SELECT ...FROM ...WHERE … ss_date between '1999-01-01' and '2001-12-31'...

Modified to run in ImpalaSELECT ...FROM ...WHERE ... -- partition key filter ss_sold_date_sk in (2451181, 2451182, 2451188, 2451189, 2451195, 2451196, 2451202, 2451203, 2451209, 2451210, 2451216, 2451217, 2451223, 2451224, 2451230, 2451231, 2451237, 2451238, 2451244, 2451245, 2451251, 2451252, 2451258, 2451259, 2451265, 2451266, 2451272, 2451273, 2451279, 2451280, 2451286, 2451287, 2451293, 2451294, 2451300, 2451301, 2451307, 2451308, 2451314, 2451315, 2451321, 2451322, 2451328, 2451329, 2451335, 2451336, 2451342, 2451343, 2451349, 2451350, 2451356, 2451357, 2451363, 2451364, 2451370, 2451371, 2451377, 2451378, 2451384, 2451385, 2451391, 2451392, 2451398, 2451399, 2451405, 2451406, 2451412, 2451413, 2451419, 2451420, 2451426, 2451427, 2451433, 2451434, 2451440, 2451441, 2451447, 2451448, 2451454, 2451455, 2451461, 2451462, 2451468, 2451469, 2451475, 2451476, 2451482, 2451483, 2451489, 2451490, 2451496, 2451497, 2451503, 2451504, 2451510, 2451511, 2451517, 2451518, 2451524, 2451525, 2451531, 2451532, 2451538, 2451539, 2451545, 2451546, 2451552, 2451553, 2451559, 2451560, 2451566, 2451567, 2451573, 2451574, 2451580, 2451581, 2451587, 2451588, 2451594, 2451595, 2451601, 2451602, 2451608, 2451609, 2451615, 2451616, 2451622, 2451623, 2451629, 2451630, 2451636, 2451637, 2451643, 2451644, 2451650, 2451651, 2451657, 2451658, 2451664, 2451665, 2451671, 2451672, 2451678, 2451679, 2451685, 2451686, 2451692, 2451693, 2451699, 2451700, 2451706, 2451707, 2451713, 2451714, 2451720, 2451721, 2451727, 2451728, 2451734, 2451735, 2451741, 2451742, 2451748, 2451749, 2451755, 2451756, 2451762, 2451763, 2451769, 2451770, 2451776, 2451777, 2451783, 2451784, 2451790, 2451791, 2451797, 2451798, 2451804, 2451805, 2451811, 2451812, 2451818, 2451819, 2451825, 2451826, 2451832, 2451833, 2451839, 2451840, 2451846, 2451847, ...

Page 17: What is HAWQ?...Apache HAWQ (incubating) Is an elastic parallel processing SQL engine that runs native in Apache™ Hadoop® to directly ... HAWQ Architecture. 1986 … 1994 1996 1998

HAWQArchitecture

Page 18: What is HAWQ?...Apache HAWQ (incubating) Is an elastic parallel processing SQL engine that runs native in Apache™ Hadoop® to directly ... HAWQ Architecture. 1986 … 1994 1996 1998

1986 … 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014

1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015

Historical Timeline

Michael Stonebraker develops Postgres at UCB

Postgres adds support for SQL

Open Source PostgreSQL

PostgreSQL 7.0 released

PostgreSQL 8.0 released

Greenplum forks PostgreSQL

Hadoop 1.0 Released

HAWQ goes Apache

HAWQ project launched

Hadoop 2.0 Released

Page 19: What is HAWQ?...Apache HAWQ (incubating) Is an elastic parallel processing SQL engine that runs native in Apache™ Hadoop® to directly ... HAWQ Architecture. 1986 … 1994 1996 1998

PostgreSQL

backend/access/

bootstrap/

catalog/

commands/

executar/

foreign/

lib/

libpq/

main/

nodes/

optimizer/

parser/

po/

port/

postmaster/

regex/

...

HAWQ

backend/access/

bootstrap/

catalog/

cdb/

commands/

executar/

foreign/

gp_libpq_fe/

gpopt/

lib/

libgppc/

libpq/

main/

nodes/

optimizer/

parser/

...

Similarities with PostgreSQL

Page 20: What is HAWQ?...Apache HAWQ (incubating) Is an elastic parallel processing SQL engine that runs native in Apache™ Hadoop® to directly ... HAWQ Architecture. 1986 … 1994 1996 1998

pxf

High Level Architecture

HDFS

Ambari

pxf

pxf

pxf

pxf

hbase

pxf

Yarn

Page 21: What is HAWQ?...Apache HAWQ (incubating) Is an elastic parallel processing SQL engine that runs native in Apache™ Hadoop® to directly ... HAWQ Architecture. 1986 … 1994 1996 1998

High Level Architecture

Parser

Session Manager

Query Rewrite

Planner ORCA

Dispatch

Interconnect Executor PXF

Storage Manager

libhdfs3

Resource Manager libyarn

Catalog

Resource Enforcer

Page 22: What is HAWQ?...Apache HAWQ (incubating) Is an elastic parallel processing SQL engine that runs native in Apache™ Hadoop® to directly ... HAWQ Architecture. 1986 … 1994 1996 1998

HAWQ Ambari Integration

Page 23: What is HAWQ?...Apache HAWQ (incubating) Is an elastic parallel processing SQL engine that runs native in Apache™ Hadoop® to directly ... HAWQ Architecture. 1986 … 1994 1996 1998

Storage Manager Design

PostgreSQL

➜ Single node

➜ Local storage

➜ Local Catalog

➜ Distributed design

➜ HDFS block storage

➜ libHDFS3

HDFS

➜ Append Only

➜ Master Catalog

➜ Metadata dispatch

HAWQ

Page 24: What is HAWQ?...Apache HAWQ (incubating) Is an elastic parallel processing SQL engine that runs native in Apache™ Hadoop® to directly ... HAWQ Architecture. 1986 … 1994 1996 1998

Data Access

➜ HAWQ supports querying unmanaged data via ○ native hcatalog integration○ pxf○ external tables

➜ HAWQ supports managed transactional tables○ Managed tables are able to provide

transaction isolation.○ Provide atomicity of data inserts○ Provide consistent views of the data

Page 25: What is HAWQ?...Apache HAWQ (incubating) Is an elastic parallel processing SQL engine that runs native in Apache™ Hadoop® to directly ... HAWQ Architecture. 1986 … 1994 1996 1998

HCatalog Access

SELECT *FROM hcatalog.ops.weblogsWHERE ts between ‘2015-09-01’ and ‘2015-09-30’;

Page 26: What is HAWQ?...Apache HAWQ (incubating) Is an elastic parallel processing SQL engine that runs native in Apache™ Hadoop® to directly ... HAWQ Architecture. 1986 … 1994 1996 1998

HCatalog Access

SELECT *FROM hcatalog.ops.weblogsWHERE ts between ‘2015-09-01’ and ‘2015-09-30’;

weblogs: id double date timestamp ...

HIVEPXF

PXF

PXFHCAT

disk heap:pg_class...

in-memory:pg_exttablepg_class...

Page 27: What is HAWQ?...Apache HAWQ (incubating) Is an elastic parallel processing SQL engine that runs native in Apache™ Hadoop® to directly ... HAWQ Architecture. 1986 … 1994 1996 1998

PXF Design

➜ Master / agent process model

➜ Exposed as external tables in HAWQ

➜ Extensible design○ Fragmenter○ Accessor○ Resolver

pxf master

pxf pxf pxf agents

HDFSHIVEHBASE...

Page 28: What is HAWQ?...Apache HAWQ (incubating) Is an elastic parallel processing SQL engine that runs native in Apache™ Hadoop® to directly ... HAWQ Architecture. 1986 … 1994 1996 1998

Concurrent Transactional Inserts

Files in hdfs

/hawq_data/.../ 0 1 2 ...

Catalog metadata segno | eof | …---------+------+... 0 | 100 | 1 | 20 | 2 | 40 | ...

Page 29: What is HAWQ?...Apache HAWQ (incubating) Is an elastic parallel processing SQL engine that runs native in Apache™ Hadoop® to directly ... HAWQ Architecture. 1986 … 1994 1996 1998

Concurrent Transactional Inserts

Files in hdfs

/hawq_data/.../ 0 <- session 1 inserts 1 2 ...

Catalog metadata segno | eof | …---------+------+... 0 | 120 | (mvcc) 1 | 20 | 2 | 40 | ...

Page 30: What is HAWQ?...Apache HAWQ (incubating) Is an elastic parallel processing SQL engine that runs native in Apache™ Hadoop® to directly ... HAWQ Architecture. 1986 … 1994 1996 1998

Concurrent Transactional Inserts

Files in hdfs

/hawq_data/.../ 0 <- session 1 inserts 1 <- session 2 inserts 2 ...

Catalog metadata segno | eof | …---------+------+... 0 | 120 | (mvcc) 1 | 220 | (mvcc) 2 | 40 | ...

Page 31: What is HAWQ?...Apache HAWQ (incubating) Is an elastic parallel processing SQL engine that runs native in Apache™ Hadoop® to directly ... HAWQ Architecture. 1986 … 1994 1996 1998

Concurrent Transactional Inserts

Files in hdfs

/hawq_data/.../ 0 <- abort / truncate 1 <- commit 2 ...

Catalog metadata segno | eof | …---------+------+... 0 | 100 | (mvcc) 1 | 220 | (mvcc) 2 | 40 | ...

HAWQ relies on HDFS Truncate support (HDFS-3107) to truncate aborted inserts so that later sessions can insert atomically

Page 32: What is HAWQ?...Apache HAWQ (incubating) Is an elastic parallel processing SQL engine that runs native in Apache™ Hadoop® to directly ... HAWQ Architecture. 1986 … 1994 1996 1998

HAWQ Metadata management

Master Catalog

➜ Stores all the system metadata

➜ Based on PostgreSQL style catalog representation

➜ Supports master mirroring for fault tolerance

➜ Provides for fully transactional DDL operations

Query Annotation

➜ Metadata is needed at query execution time on the workers

➜ The most efficient method of providing metadata is to dispatch it with the query

➜ Achieved by walking the plan prior to dispatch and annotating with query metadada

Local Catalog Cache

➜ Each worker has native understand of all bootstrap types

➜ Data dispatched with the query is added to a local cache for the duration of a query.

➜ Each worker is effectively stateless and receives the needed metadata at execution time.

Page 33: What is HAWQ?...Apache HAWQ (incubating) Is an elastic parallel processing SQL engine that runs native in Apache™ Hadoop® to directly ... HAWQ Architecture. 1986 … 1994 1996 1998

HAWQ Distributed Query Engine

Motion 2 phase aggregation Dispatch

explain select * from a join b on (a.i=b.j); QUERY PLAN ---------------------------------------------------------------------------------------------------- Gather Motion 2:1 -> Hash Join Hash Cond: a.i = b.j -> Seq Scan on a -> Hash -> Redistribute Motion 2:2 Hash Key: b.j -> Seq Scan on b

● GATHER Motion: Data from all nodes is brought to 1 location

● REDISTRIBUTE Motion: Data is hash partitioned between virtual segments

● BROADCAST Motion: Data is broadcast to all virtual segments

Page 34: What is HAWQ?...Apache HAWQ (incubating) Is an elastic parallel processing SQL engine that runs native in Apache™ Hadoop® to directly ... HAWQ Architecture. 1986 … 1994 1996 1998

HAWQ Distributed Query Engine

Motion 2 phase aggregation Pipelines

Join A Join B Join C “copartitioned” join “redistributed” join “broadcast” join

GATHER GATHER GATHER | | | Join Join Join / \ / \ / \ A B A REDISTRIBUTE A BROADCAST | | B B

Page 35: What is HAWQ?...Apache HAWQ (incubating) Is an elastic parallel processing SQL engine that runs native in Apache™ Hadoop® to directly ... HAWQ Architecture. 1986 … 1994 1996 1998

HAWQ Distributed Query Engine

Motion 2 phase aggregation Pipelines

explain select count(*) from b group by j; QUERY PLAN ----------------------------------------------------------------------------------------------- Gather Motion 2:1 -> HashAggregate Group By: b.j -> Redistribute Motion 2:2 Hash Key: b.j -> HashAggregate Group By: b.j -> Seq Scan on b ● Similar in concept to COMBINE/REDUCE in Hadoop

● Local aggregation occurs on the data processed by each virtual segment

● 2nd phase aggregation occurs after GATHER/REDISTRIBUTE to accumulate partial aggregations from individual virtual segments

Page 36: What is HAWQ?...Apache HAWQ (incubating) Is an elastic parallel processing SQL engine that runs native in Apache™ Hadoop® to directly ... HAWQ Architecture. 1986 … 1994 1996 1998

HAWQ Distributed Query Engine

Motion 2 phase aggregation Pipelines

● Each Executor node operates on a “pull” based model

● Several nodes may be active at any time

● Most nodes are non-blocking

● Optimized such that inactive executor nodes do not occupy resources.

Page 37: What is HAWQ?...Apache HAWQ (incubating) Is an elastic parallel processing SQL engine that runs native in Apache™ Hadoop® to directly ... HAWQ Architecture. 1986 … 1994 1996 1998

Resource Manager Design

Yarn

➜ Provisions containers to Yarn Applications

➜ Provides multi-tenant Resource Management across applications

➜ Support for different scheduling policies

○ fair scheduler○ capacity

scheduler

HAWQ RM

➜ Requests resources from Yarn when needed

➜ Returns resources to Yarn when unused

➜ Provides Low latency allocation of HAWQ containers to queries

➜ Determines how many resources to allocate to a query

HAWQ Dispatch

➜ Allocates HAWQ virtual segments to a query

➜ Assigns HDFS blocks to HAWQ virtual segments

➜ Allocates resources within Yarn containers to individual chunks of a distributed query plan

Page 38: What is HAWQ?...Apache HAWQ (incubating) Is an elastic parallel processing SQL engine that runs native in Apache™ Hadoop® to directly ... HAWQ Architecture. 1986 … 1994 1996 1998

Resource Manager Design

Yarn HAWQ RM HAWQ Dispatch

pxf

HDFS

Ambari

pxf

pxf

pxf

pxf

hbase

pxf

Yarn

Page 39: What is HAWQ?...Apache HAWQ (incubating) Is an elastic parallel processing SQL engine that runs native in Apache™ Hadoop® to directly ... HAWQ Architecture. 1986 … 1994 1996 1998

Resource Manager Design

Yarn HAWQ RM HAWQ Dispatch

Q1 Q1 Q1

Q1 Q1Q1

Q2

Page 40: What is HAWQ?...Apache HAWQ (incubating) Is an elastic parallel processing SQL engine that runs native in Apache™ Hadoop® to directly ... HAWQ Architecture. 1986 … 1994 1996 1998

Resource Manager Design

Yarn HAWQ RM HAWQ Dispatch

Q1 Q1 Q1

Q1 Q1Q1

HDFS

Page 41: What is HAWQ?...Apache HAWQ (incubating) Is an elastic parallel processing SQL engine that runs native in Apache™ Hadoop® to directly ... HAWQ Architecture. 1986 … 1994 1996 1998

HAWQ Extensibility

➜ User Defined Functions

➜ User Defined Aggregates

➜ User Defined Operators

➜ User Defined Types

➜ Supports multiple languages

Page 42: What is HAWQ?...Apache HAWQ (incubating) Is an elastic parallel processing SQL engine that runs native in Apache™ Hadoop® to directly ... HAWQ Architecture. 1986 … 1994 1996 1998

HAWQ Machine Learning

Apache MADlib (incubating)

➜ Leverages robust extensibility

➜ Provides in database machine learning capabilities

➜ Supports ○ Clustering ○ Regression○ Classification○ Topic Modeling○ … and much more

http://madlib.incubator.apache.org/

Page 43: What is HAWQ?...Apache HAWQ (incubating) Is an elastic parallel processing SQL engine that runs native in Apache™ Hadoop® to directly ... HAWQ Architecture. 1986 … 1994 1996 1998

[email protected] [email protected] [email protected]

Websitehttp://hawq.incubator.apache.org/

Wikihttps://cwiki.apache.org/confluence/display/HAWQ

Github mirrorhttps://github.com/apache/incubator-hawq/

Bug reporting

https://issues.apache.org/jira/browse/HAWQHADOOP NATIVE SQL

Questions?