sql on hadoop cmsc 491 hadoop-based distributed computing spring 2015 adam shook

16
SQL on Hadoop CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Upload: ami-atkins

Post on 22-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SQL on Hadoop CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

SQL on Hadoop

CMSC 491Hadoop-Based Distributed Computing

Spring 2015Adam Shook

Page 2: SQL on Hadoop CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

ALL OF THESEBut HAWQ specificallyCause that's what I know

Page 3: SQL on Hadoop CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Problem!

• “MapReduce is great, but all of my data dudes don’t know Java”

• Well, Pig and Hive exist... They are kind of SQL• “But Pig and Hive are slow and they aren’t

really SQL... How can I efficiently use all of my SQL scripts that I have today?”

• Well, that's why all these companies are building SQL on Hadoop engines... Like HAWQ.

Page 4: SQL on Hadoop CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

SQL Engines for Hadoop

• Massive Parallel Processing (MPP) frameworks to run SQL queries against data stored in HDFS

• Not MapReduce, but still brings the code to the data

• SQL for big data sets, but not stupid huge ones• Stupid huge ones should still use MapReduce

Page 5: SQL on Hadoop CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Current SQL Landscape

• Apache Drill (MapR)• Cloudera Impala• Facebook Presto• Hive Stinger (Hortonworks)• Pivotal HAWQ• Shark – Hive on Spark (Berkeley)

Page 6: SQL on Hadoop CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Why?

• Ability to execute complex multi-staged queries in-memory against structured data

• Available SQL-based machine learning libraries can be ported to work on the system

• A well-known and common query language to express data crunching algorithms

• Not all queries need to run for hours on end and be super fault tolerant

Page 7: SQL on Hadoop CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Okay, tell me more...

• Many visualization and ETL tools speak SQL, and need to do some hacked version for HiveQL

• Can now connect these tools and legacy applications to “big data” stored in HDFS

• You can start leveraging Hadoop with what you know and begin to explore other Hadoop ecosystem projects

• Your Excuse Here

Page 8: SQL on Hadoop CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

SQL on Hadoop

• Built for analytics!– OLAP vs OLTP

• Large I/O queries against append-only tables• Write-once, read-many much like MapReduce• Intent is to retrieve results and run deep

analytics in ~20 minutes• Anything longer, you may want to consider

using MapReduce

Page 9: SQL on Hadoop CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Architectures

• Architectures are all very similar

Master

Query Planner

Query Executor

Query Executor

Query Executor

Query Executor

HDFS

Page 10: SQL on Hadoop CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Basic HAWQ Architecture

Page 11: SQL on Hadoop CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Parallel Query Optimizer

• Cost-based optimization looks for the most

efficient plan

• Physical plan contains scans, joins, sorts,

aggregations, etc.

• Directly inserts ‘motion’ nodes for inter-

segment communication

Page 12: SQL on Hadoop CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Parallel Query Optimizer Continued

• Inserts motion nodes for efficient non-local join processing

(Assume table A is distributed across all segments – i.e. each has

AK)

– Broadcast Motion (N:N)

• Every segment sends AK to all other segments

– Redistribute Motion (N:N)

• Every segment rehashes AK (by join column) and redistributes each row

– Gather Motion (N:1)

• Every segment sends its AK to a single node (usually the master)

Page 13: SQL on Hadoop CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Parallel Query Optimization Example SELECT c_custkey, c_name, sum(l_extendedprice * (1 - l_discount)) as revenue, c_acctbal, n_name, c_address, c_phone, c_comment

FROM customer, orders, lineitem, nation

WHERE c_custkey = o_custkey and l_orderkey = o_orderkey and o_orderdate >= date '1994-08-01' and o_orderdate < date '1994-08-01' + interval '3 month' and l_returnflag = 'R' and c_nationkey = n_nationkey

GROUP BY c_custkey, c_name, c_acctbal, c_phone, n_name, c_address, c_comment

ORDER BY revenue desc

Page 14: SQL on Hadoop CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Data Distributions

• Every table has a distribution method• DISTRIBUTED BY (column)– Uses a hash distribution

• DISTRIBUTED RANDOMLY– Uses a random distribution which is not

guaranteed to provide a perfectly even distribution

Page 15: SQL on Hadoop CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Multi-Level Partitioning

Use Hash Distribution to evenly spread data across all nodes

Use Range Partition within a node to minimize scan work

Segment 1A Segment 1B Segment 1C Segment 1D

Segment 2A Segment 2B Segment 2C Segment 2D

Segment 3A Segment 3B Segment 3C Segment 3D

Jan 2007Feb 2007Mar 2007Apr 2007May 2007Jun 2007Jul 2007

Aug 2007Sep 2007Oct 2007Nov 2007Dec 2007