split query processing in polybase - harvard seasdaslab.seas.harvard.edu/classes/cs265/files/... ·...
TRANSCRIPT
Split Query Processing in Polybase
Varun SriramFrederick Widjaja
Problem: Querying Data in Multiple Formats
Relational “Structured” Distributed File System “Unstructured”
Problem: Querying Data in Multiple Formats
Relational “Structured” Distributed File System “Unstructured”
When do we use each?
Problem: Querying Data in Multiple Formats
Relational “Structured” Distributed File System “Unstructured”
In what situations (if ever) do we need both?
Problem: Querying Data in Multiple Formats“SQL-on-Hadoop”
Native Hadoop systems Database-Hadoop hybrids
Problem: Querying Data in Multiple Formats“SQL-on-Hadoop”
Native Hadoop systems Database-Hadoop hybrids
Why do we need SQL to query each?
Existing Solution: EXTERNAL TABLES
Existing Solution: Hadapt
Hadapt: 2 selects and 1 join
HDFS
DB
Filter
Filter
Join via MapReduce
Join in PostgreSQL
Polybase: PDW Architecture
Polybase: EXTERNAL TABLES
Polybase: Communicating With HDFS
Polybase USe CASES
QUERY OPTIMIZATION
SELECT count (*) from CustomerWHERE acctbal < 0GROUP BY nationkey
Table Customer is stored on HDFS
QUERY OPTIMIZATION
QUERY OPTIMIZATION
QUERY OPTIMIZATION
QUERY OPTIMIZATION
JOIN ON PDW/HDFS
Perform Join with Map-Reduce Perform Join in PDW
JOIN ON HDFS/HDFS
Perform Join with Map-Reduce Perform Join in PDW
EXPERIMENT GOALS
EXPERIMENT GOALS
Is this the right approach?
EXPERIMENT QUERY 1
SELECT TOP 10 unique1, unique2, unique4, stringu1, stringu2, string4FROM T1WHERE (unique1 % 100) < T1-SFORDER BY unique1
Table T1 is stored on HDFS
EXPERIMENT QUERY 1 - Results
16 node PDW cluster and48 node Hadoop cluster(C-16/48)
30 node PDW cluster and30 node Hadoop cluster(C-30/30)
60 node PDW cluster and60 node Hadoop clusterco-located on the same nodes(C60)
EXPERIMENT QUERY 2SELECT TOP 10 T1.unique1, T1.unique2, T2.unique3, T2.stringu1, T2.stringu2FROM T1 INNER JOIN T2 ON (T1.unique1 = T2.unique2)WHERE T1.onePercent < T1-SF AND T2.onePercent < T2-SFORDER BY T1.unique2
“Independent” join of T1 and T2
EXPERIMENT QUERY 2
EXPERIMENT QUERY 2 - Results
C-16/48 C-30/30 C60
EXPERIMENT QUERY 2 - Results
C-16/48 C-30/30 C60
EXPERIMENT QUERY 3SELECT TOP 10 T1.unique1, T1.unique2, T2.unique3, T2.stringu1, T2.stringu2FROM T1 INNER JOIN T2 ON (T1.unique1 = T2.unique1)WHERE T1.onePercent < T1-SF AND T2.onePercent < T2-SFORDER BY T1.unique2
“Correlated” join of T1 and T2
EXPERIMENT QUERY 3
EXPERIMENT QUERY 3 - Results
C-16/48 C-30/30 C60
NEXT STEPS
NEXT STEPS● Realistic workload experiments comparing to other versions of
database/Hadoop hybrid systems● More investigation into optimal cost-based query optimizers, and what
factors should go into it