split query processing in polybase - harvard seasdaslab.seas.harvard.edu/classes/cs265/files/... ·...

34
Split Query Processing in Polybase Varun Sriram Frederick Widjaja

Upload: others

Post on 22-Sep-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Split Query Processing in Polybase - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Native Hadoop systems Database-Hadoop hybrids Why do we need SQL

Split Query Processing in Polybase

Varun SriramFrederick Widjaja

Page 2: Split Query Processing in Polybase - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Native Hadoop systems Database-Hadoop hybrids Why do we need SQL
Page 3: Split Query Processing in Polybase - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Native Hadoop systems Database-Hadoop hybrids Why do we need SQL

Problem: Querying Data in Multiple Formats

Relational “Structured” Distributed File System “Unstructured”

Page 4: Split Query Processing in Polybase - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Native Hadoop systems Database-Hadoop hybrids Why do we need SQL

Problem: Querying Data in Multiple Formats

Relational “Structured” Distributed File System “Unstructured”

When do we use each?

Page 5: Split Query Processing in Polybase - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Native Hadoop systems Database-Hadoop hybrids Why do we need SQL

Problem: Querying Data in Multiple Formats

Relational “Structured” Distributed File System “Unstructured”

In what situations (if ever) do we need both?

Page 6: Split Query Processing in Polybase - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Native Hadoop systems Database-Hadoop hybrids Why do we need SQL

Problem: Querying Data in Multiple Formats“SQL-on-Hadoop”

Native Hadoop systems Database-Hadoop hybrids

Page 7: Split Query Processing in Polybase - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Native Hadoop systems Database-Hadoop hybrids Why do we need SQL

Problem: Querying Data in Multiple Formats“SQL-on-Hadoop”

Native Hadoop systems Database-Hadoop hybrids

Why do we need SQL to query each?

Page 8: Split Query Processing in Polybase - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Native Hadoop systems Database-Hadoop hybrids Why do we need SQL

Existing Solution: EXTERNAL TABLES

Page 9: Split Query Processing in Polybase - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Native Hadoop systems Database-Hadoop hybrids Why do we need SQL

Existing Solution: Hadapt

Page 10: Split Query Processing in Polybase - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Native Hadoop systems Database-Hadoop hybrids Why do we need SQL

Hadapt: 2 selects and 1 join

HDFS

DB

Filter

Filter

Join via MapReduce

Join in PostgreSQL

Page 11: Split Query Processing in Polybase - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Native Hadoop systems Database-Hadoop hybrids Why do we need SQL

Polybase: PDW Architecture

Page 12: Split Query Processing in Polybase - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Native Hadoop systems Database-Hadoop hybrids Why do we need SQL

Polybase: EXTERNAL TABLES

Page 13: Split Query Processing in Polybase - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Native Hadoop systems Database-Hadoop hybrids Why do we need SQL

Polybase: Communicating With HDFS

Page 14: Split Query Processing in Polybase - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Native Hadoop systems Database-Hadoop hybrids Why do we need SQL

Polybase USe CASES

Page 15: Split Query Processing in Polybase - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Native Hadoop systems Database-Hadoop hybrids Why do we need SQL

QUERY OPTIMIZATION

SELECT count (*) from CustomerWHERE acctbal < 0GROUP BY nationkey

Table Customer is stored on HDFS

Page 16: Split Query Processing in Polybase - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Native Hadoop systems Database-Hadoop hybrids Why do we need SQL

QUERY OPTIMIZATION

Page 17: Split Query Processing in Polybase - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Native Hadoop systems Database-Hadoop hybrids Why do we need SQL

QUERY OPTIMIZATION

Page 18: Split Query Processing in Polybase - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Native Hadoop systems Database-Hadoop hybrids Why do we need SQL

QUERY OPTIMIZATION

Page 19: Split Query Processing in Polybase - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Native Hadoop systems Database-Hadoop hybrids Why do we need SQL

QUERY OPTIMIZATION

Page 20: Split Query Processing in Polybase - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Native Hadoop systems Database-Hadoop hybrids Why do we need SQL

JOIN ON PDW/HDFS

Perform Join with Map-Reduce Perform Join in PDW

Page 21: Split Query Processing in Polybase - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Native Hadoop systems Database-Hadoop hybrids Why do we need SQL

JOIN ON HDFS/HDFS

Perform Join with Map-Reduce Perform Join in PDW

Page 22: Split Query Processing in Polybase - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Native Hadoop systems Database-Hadoop hybrids Why do we need SQL

EXPERIMENT GOALS

Page 23: Split Query Processing in Polybase - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Native Hadoop systems Database-Hadoop hybrids Why do we need SQL

EXPERIMENT GOALS

Is this the right approach?

Page 24: Split Query Processing in Polybase - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Native Hadoop systems Database-Hadoop hybrids Why do we need SQL

EXPERIMENT QUERY 1

SELECT TOP 10 unique1, unique2, unique4, stringu1, stringu2, string4FROM T1WHERE (unique1 % 100) < T1-SFORDER BY unique1

Table T1 is stored on HDFS

Page 25: Split Query Processing in Polybase - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Native Hadoop systems Database-Hadoop hybrids Why do we need SQL

EXPERIMENT QUERY 1 - Results

16 node PDW cluster and48 node Hadoop cluster(C-16/48)

30 node PDW cluster and30 node Hadoop cluster(C-30/30)

60 node PDW cluster and60 node Hadoop clusterco-located on the same nodes(C60)

Page 26: Split Query Processing in Polybase - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Native Hadoop systems Database-Hadoop hybrids Why do we need SQL

EXPERIMENT QUERY 2SELECT TOP 10 T1.unique1, T1.unique2, T2.unique3, T2.stringu1, T2.stringu2FROM T1 INNER JOIN T2 ON (T1.unique1 = T2.unique2)WHERE T1.onePercent < T1-SF AND T2.onePercent < T2-SFORDER BY T1.unique2

“Independent” join of T1 and T2

Page 27: Split Query Processing in Polybase - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Native Hadoop systems Database-Hadoop hybrids Why do we need SQL

EXPERIMENT QUERY 2

Page 28: Split Query Processing in Polybase - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Native Hadoop systems Database-Hadoop hybrids Why do we need SQL

EXPERIMENT QUERY 2 - Results

C-16/48 C-30/30 C60

Page 29: Split Query Processing in Polybase - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Native Hadoop systems Database-Hadoop hybrids Why do we need SQL

EXPERIMENT QUERY 2 - Results

C-16/48 C-30/30 C60

Page 30: Split Query Processing in Polybase - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Native Hadoop systems Database-Hadoop hybrids Why do we need SQL

EXPERIMENT QUERY 3SELECT TOP 10 T1.unique1, T1.unique2, T2.unique3, T2.stringu1, T2.stringu2FROM T1 INNER JOIN T2 ON (T1.unique1 = T2.unique1)WHERE T1.onePercent < T1-SF AND T2.onePercent < T2-SFORDER BY T1.unique2

“Correlated” join of T1 and T2

Page 31: Split Query Processing in Polybase - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Native Hadoop systems Database-Hadoop hybrids Why do we need SQL

EXPERIMENT QUERY 3

Page 32: Split Query Processing in Polybase - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Native Hadoop systems Database-Hadoop hybrids Why do we need SQL

EXPERIMENT QUERY 3 - Results

C-16/48 C-30/30 C60

Page 33: Split Query Processing in Polybase - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Native Hadoop systems Database-Hadoop hybrids Why do we need SQL

NEXT STEPS

Page 34: Split Query Processing in Polybase - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Native Hadoop systems Database-Hadoop hybrids Why do we need SQL

NEXT STEPS● Realistic workload experiments comparing to other versions of

database/Hadoop hybrid systems● More investigation into optimal cost-based query optimizers, and what

factors should go into it