sql on hadoop cmsc 491 hadoop-based distributed computing spring 2015 adam shook

SQL on Hadoop

CMSC 491Hadoop-Based Distributed Computing

Spring 2015Adam Shook

ALL OF THESEBut HAWQ specificallyCause that's what I know

Problem!

• “MapReduce is great, but all of my data dudes don’t know Java”

• Well, Pig and Hive exist... They are kind of SQL• “But Pig and Hive are slow and they aren’t

really SQL... How can I efficiently use all of my SQL scripts that I have today?”

• Well, that's why all these companies are building SQL on Hadoop engines... Like HAWQ.

SQL Engines for Hadoop

• Massive Parallel Processing (MPP) frameworks to run SQL queries against data stored in HDFS

• Not MapReduce, but still brings the code to the data

• SQL for big data sets, but not stupid huge ones• Stupid huge ones should still use MapReduce

Current SQL Landscape

• Apache Drill (MapR)• Cloudera Impala• Facebook Presto• Hive Stinger (Hortonworks)• Pivotal HAWQ• Shark – Hive on Spark (Berkeley)

Why?

• Ability to execute complex multi-staged queries in-memory against structured data

• Available SQL-based machine learning libraries can be ported to work on the system

• A well-known and common query language to express data crunching algorithms

• Not all queries need to run for hours on end and be super fault tolerant

Okay, tell me more...

• Many visualization and ETL tools speak SQL, and need to do some hacked version for HiveQL

• Can now connect these tools and legacy applications to “big data” stored in HDFS

• You can start leveraging Hadoop with what you know and begin to explore other Hadoop ecosystem projects

• Your Excuse Here

SQL on Hadoop

• Built for analytics!– OLAP vs OLTP

• Large I/O queries against append-only tables• Write-once, read-many much like MapReduce• Intent is to retrieve results and run deep

analytics in ~20 minutes• Anything longer, you may want to consider

using MapReduce

Architectures

• Architectures are all very similar

Master

Query Planner

Query Executor

Query Executor

Query Executor

Query Executor

HDFS

Basic HAWQ Architecture

Parallel Query Optimizer

• Cost-based optimization looks for the most

efficient plan

• Physical plan contains scans, joins, sorts,

aggregations, etc.

• Directly inserts ‘motion’ nodes for inter-

segment communication

Parallel Query Optimizer Continued

• Inserts motion nodes for efficient non-local join processing

(Assume table A is distributed across all segments – i.e. each has

AK)

– Broadcast Motion (N:N)

• Every segment sends AK to all other segments

– Redistribute Motion (N:N)

• Every segment rehashes AK (by join column) and redistributes each row

– Gather Motion (N:1)

• Every segment sends its AK to a single node (usually the master)

Parallel Query Optimization Example SELECT c_custkey, c_name, sum(l_extendedprice * (1 - l_discount)) as revenue, c_acctbal, n_name, c_address, c_phone, c_comment

FROM customer, orders, lineitem, nation

WHERE c_custkey = o_custkey and l_orderkey = o_orderkey and o_orderdate >= date '1994-08-01' and o_orderdate < date '1994-08-01' + interval '3 month' and l_returnflag = 'R' and c_nationkey = n_nationkey

GROUP BY c_custkey, c_name, c_acctbal, c_phone, n_name, c_address, c_comment

ORDER BY revenue desc

Data Distributions

• Every table has a distribution method• DISTRIBUTED BY (column)– Uses a hash distribution

• DISTRIBUTED RANDOMLY– Uses a random distribution which is not

guaranteed to provide a perfectly even distribution

Multi-Level Partitioning

Use Hash Distribution to evenly spread data across all nodes

Use Range Partition within a node to minimize scan work

Segment 1A Segment 1B Segment 1C Segment 1D



Jan 2007Feb 2007Mar 2007Apr 2007May 2007Jun 2007Jul 2007

Aug 2007Sep 2007Oct 2007Nov 2007Dec 2007

References

• Apache Drill• Cloudera Impala• Facebook Presto• Hive Stinger• Pivotal HAWQ• Shark

http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html

http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html

http://prestodb.io/

http://hortonworks.com/labs/stinger/

http://hortonworks.com/labs/stinger/

http://www.gopivotal.com/big-data/pivotal-hd

http://shark.cs.berkeley.edu/

sql on hadoop cmsc 491 hadoop-based distributed computing spring 2015 adam shook

Documents