1 experimental evidence on partitioning in parallel data warehouses pedro furtado prof. at univ. of...
TRANSCRIPT
![Page 1: 1 Experimental Evidence on Partitioning in Parallel Data Warehouses Pedro Furtado Prof. at Univ. of Coimbra & Researcher at CISUC DEI/CISUC-Universidade](https://reader035.vdocuments.net/reader035/viewer/2022081603/56649e635503460f94b5fca4/html5/thumbnails/1.jpg)
1
Experimental Evidence on Experimental Evidence on
Partitioning Partitioning
in Parallel Data Warehousesin Parallel Data Warehouses
Pedro FurtadoPedro FurtadoProf. at Univ. of CoimbraProf. at Univ. of Coimbra
& Researcher at CISUC& Researcher at CISUCDEI/CISUC-Universidade de CoimbraDEI/CISUC-Universidade de Coimbra
PortugalPortugal
![Page 2: 1 Experimental Evidence on Partitioning in Parallel Data Warehouses Pedro Furtado Prof. at Univ. of Coimbra & Researcher at CISUC DEI/CISUC-Universidade](https://reader035.vdocuments.net/reader035/viewer/2022081603/56649e635503460f94b5fca4/html5/thumbnails/2.jpg)
2
Pedro Furtado, DOLAP 2004 Pedro Furtado, DOLAP 2004
ContextContext
• Parallelism used for major performance improvement in large Data warehouses
• Using simple low-cost shared-nothing architecture– Without any efficiency requirements on Network or Nodes
NODE PARTITIONED DATA WAREHOUSE
• Minimize inter-node data exchange requirements– Horizontally fully-partition facts (largest), rest of relations are
replicated
• Hope to obtain near-to-linear speedup
![Page 3: 1 Experimental Evidence on Partitioning in Parallel Data Warehouses Pedro Furtado Prof. at Univ. of Coimbra & Researcher at CISUC DEI/CISUC-Universidade](https://reader035.vdocuments.net/reader035/viewer/2022081603/56649e635503460f94b5fca4/html5/thumbnails/3.jpg)
3
Pedro Furtado, DOLAP 2004 Pedro Furtado, DOLAP 2004
to run it n times faster …“Divide to conquer”
- Horizontally Partition Large Facts (randomly)
into n Nodes
- Replicate other Relations (Small Dimensions?)
Node 1
D2D1
D3 D4
Sales
Node 2
Sales
D2D1
D3 D4
Node 3
D2D1
D3 D4
Sales
Sales
D2D1
D3 D4
![Page 4: 1 Experimental Evidence on Partitioning in Parallel Data Warehouses Pedro Furtado Prof. at Univ. of Coimbra & Researcher at CISUC DEI/CISUC-Universidade](https://reader035.vdocuments.net/reader035/viewer/2022081603/56649e635503460f94b5fca4/html5/thumbnails/4.jpg)
4
Pedro Furtado, DOLAP 2004 Pedro Furtado, DOLAP 2004
Why Replicate Dimensions?Why Replicate Dimensions?• We replicated because we would not need to repartition
nodesall
jn
n
_
1njAAA
nAAA
R R j
Fact
R R Fact
21j1
211
nodesall
jn
n
_
1nAAA
nAAA
R R j
Fact
R R Fact
211
211
Wouldn´t work with partitioned dimensions:
…and you can do other ops independently as well
![Page 5: 1 Experimental Evidence on Partitioning in Parallel Data Warehouses Pedro Furtado Prof. at Univ. of Coimbra & Researcher at CISUC DEI/CISUC-Universidade](https://reader035.vdocuments.net/reader035/viewer/2022081603/56649e635503460f94b5fca4/html5/thumbnails/5.jpg)
5
Pedro Furtado, DOLAP 2004 Pedro Furtado, DOLAP 2004
Query processingQuery processing
SUM(X) over 1/n FACT, Ds GROUP BY dims
SUM(X) over 1/n FACT, DsGROUP BY dims
SUM(X) over 1/n FACT , DsGROUP BY dims
SUM(SUMs) SUM(X) over FACT, dims GROUP BY dims
![Page 6: 1 Experimental Evidence on Partitioning in Parallel Data Warehouses Pedro Furtado Prof. at Univ. of Coimbra & Researcher at CISUC DEI/CISUC-Universidade](https://reader035.vdocuments.net/reader035/viewer/2022081603/56649e635503460f94b5fca4/html5/thumbnails/6.jpg)
6
Pedro Furtado, DOLAP 2004 Pedro Furtado, DOLAP 2004
Query Processing StepsQuery Processing Steps
RewriteQuery
Send Query
Compute Partial Result
Send Partial Results
Apply MergeQuery
Computing Nodes
1. 2.
3. 5.
6.
Redistribute
Submitter Node
Repartition
4.
7.
![Page 7: 1 Experimental Evidence on Partitioning in Parallel Data Warehouses Pedro Furtado Prof. at Univ. of Coimbra & Researcher at CISUC DEI/CISUC-Universidade](https://reader035.vdocuments.net/reader035/viewer/2022081603/56649e635503460f94b5fca4/html5/thumbnails/7.jpg)
7
Pedro Furtado, DOLAP 2004 Pedro Furtado, DOLAP 2004
Problem (TPC-H case study)Problem (TPC-H case study)
PartSupp
Supplier
Customer
Orders
Lineitem
Part
Very large
Large
? ?
• Many typical Schemas are “Complex” – many large
relations may exist
Medium
![Page 8: 1 Experimental Evidence on Partitioning in Parallel Data Warehouses Pedro Furtado Prof. at Univ. of Coimbra & Researcher at CISUC DEI/CISUC-Universidade](https://reader035.vdocuments.net/reader035/viewer/2022081603/56649e635503460f94b5fca4/html5/thumbnails/8.jpg)
8
Pedro Furtado, DOLAP 2004 Pedro Furtado, DOLAP 2004
Problem StatementProblem Statement
• Divide by N … would expect N times faster - Linear Speedup (LS)
• However, we don´t get the LS
![Page 9: 1 Experimental Evidence on Partitioning in Parallel Data Warehouses Pedro Furtado Prof. at Univ. of Coimbra & Researcher at CISUC DEI/CISUC-Universidade](https://reader035.vdocuments.net/reader035/viewer/2022081603/56649e635503460f94b5fca4/html5/thumbnails/9.jpg)
9
Pedro Furtado, DOLAP 2004 Pedro Furtado, DOLAP 2004
Our Major ContributionsOur Major Contributions
• Show these problems experimentally – performance evaluation benchmark TPC-H: We EXPLAIN AND
ILLUSTRATE the LARGE RELATIONS problem
• Identify simple modifications to improve results
• Analyze the modifications experimentally
![Page 10: 1 Experimental Evidence on Partitioning in Parallel Data Warehouses Pedro Furtado Prof. at Univ. of Coimbra & Researcher at CISUC DEI/CISUC-Universidade](https://reader035.vdocuments.net/reader035/viewer/2022081603/56649e635503460f94b5fca4/html5/thumbnails/10.jpg)
10
Pedro Furtado, DOLAP 2004 Pedro Furtado, DOLAP 2004
Partitioning Facts (Largest)Partitioning Facts (Largest)
• LI + PS Partitioned
PS
S
C
O Li
P
PS
S
C
O
Li
P
S
C
O
Li
P
PS
Node 1
Node N
![Page 11: 1 Experimental Evidence on Partitioning in Parallel Data Warehouses Pedro Furtado Prof. at Univ. of Coimbra & Researcher at CISUC DEI/CISUC-Universidade](https://reader035.vdocuments.net/reader035/viewer/2022081603/56649e635503460f94b5fca4/html5/thumbnails/11.jpg)
11
Pedro Furtado, DOLAP 2004 Pedro Furtado, DOLAP 2004
• Generated TPC-H 50GB into 1 and 25 nodes
• Used PCs (Pentium III 866 MHz CPU) 512MB RAM
• Oracle 9i, tuned initial setting
• TPC-H 22 query set
• Measured Response Time: 1 node against 25 nodes
• We show that the speedup underachievement is explained mostly
by the size of replicated dimensions
![Page 12: 1 Experimental Evidence on Partitioning in Parallel Data Warehouses Pedro Furtado Prof. at Univ. of Coimbra & Researcher at CISUC DEI/CISUC-Universidade](https://reader035.vdocuments.net/reader035/viewer/2022081603/56649e635503460f94b5fca4/html5/thumbnails/12.jpg)
12
Pedro Furtado, DOLAP 2004 Pedro Furtado, DOLAP 2004
Experimental ResultsExperimental Results
0 10 20 30
Q6
Q1
Q15
Que
ry
Speedup
LS Speedup: 25-30
0 5 10 15
Q19
Q11
Q14
Que
ry
Speedup
• Only a few queries exhibited near-to-LS!
Medium Speedup 6-15
0 1 2 3 4 5 6
Q7
Q5
Q9
Q3
Q16
Q12
Q10
Que
ry
Speedup
Low Speedup 2-6
0 0.5 1 1.5 2
Q8
Q22
Q4
Q13
Q21
Q2
Que
ry
Speedup
Very Low Speedup 0.4-1.9
![Page 13: 1 Experimental Evidence on Partitioning in Parallel Data Warehouses Pedro Furtado Prof. at Univ. of Coimbra & Researcher at CISUC DEI/CISUC-Universidade](https://reader035.vdocuments.net/reader035/viewer/2022081603/56649e635503460f94b5fca4/html5/thumbnails/13.jpg)
13
Pedro Furtado, DOLAP 2004 Pedro Furtado, DOLAP 2004
Some had Linear Speedup…Some had Linear Speedup…
0 10 20 30
Q6
Q1
Q15
Que
ry
Speedup
LS Speedup 25-30
S
C
O
Li
P
Q15:
PS
•S is reasonably small relative to Li/N
S
C
O
P
Q1, Q6:
LiPS
•Access only fragments (Li/N)
![Page 14: 1 Experimental Evidence on Partitioning in Parallel Data Warehouses Pedro Furtado Prof. at Univ. of Coimbra & Researcher at CISUC DEI/CISUC-Universidade](https://reader035.vdocuments.net/reader035/viewer/2022081603/56649e635503460f94b5fca4/html5/thumbnails/14.jpg)
14
Pedro Furtado, DOLAP 2004 Pedro Furtado, DOLAP 2004
Others had smaller speedup…Others had smaller speedup…
Medium Speedup 6-15
S
C
O
Li
P
S
C
O
Li
P
Q14, Q19: Q11
0 5 10 15
Q19
Q11
Q14
Que
ry
Speedup
PSPS
•P is not small relative to fragment (Li/N) •S is not small relative to PS/N
![Page 15: 1 Experimental Evidence on Partitioning in Parallel Data Warehouses Pedro Furtado Prof. at Univ. of Coimbra & Researcher at CISUC DEI/CISUC-Universidade](https://reader035.vdocuments.net/reader035/viewer/2022081603/56649e635503460f94b5fca4/html5/thumbnails/15.jpg)
15
Pedro Furtado, DOLAP 2004 Pedro Furtado, DOLAP 2004
What Happened…What Happened…
• With N nodes we would like to:– process 1/N of the data, have about N times speedup
• However, we have replicated relations…
• The amount of speedup degradation depends on the size of
R2 relative to R1/N
21
1,21 RR
Nconst
N
R
N
R
constRN
R ,
21
![Page 16: 1 Experimental Evidence on Partitioning in Parallel Data Warehouses Pedro Furtado Prof. at Univ. of Coimbra & Researcher at CISUC DEI/CISUC-Universidade](https://reader035.vdocuments.net/reader035/viewer/2022081603/56649e635503460f94b5fca4/html5/thumbnails/16.jpg)
16
Pedro Furtado, DOLAP 2004 Pedro Furtado, DOLAP 2004
Low Speedup Queries:Low Speedup Queries:Speedup 2-5.5
S
C
O
Li
P
S
C
O
Li
P
Q3, Q5, Q7, Q10, Q12:
Q16:
PS
PS
•O is large relative to Li/N
•P is large relative to PS/N0 1 2 3 4 5 6
Q7
Q5
Q9
Q3
Q16
Q12
Q10
Que
ry
Speedup
S
C
O
Li
P
Q9:
PS
![Page 17: 1 Experimental Evidence on Partitioning in Parallel Data Warehouses Pedro Furtado Prof. at Univ. of Coimbra & Researcher at CISUC DEI/CISUC-Universidade](https://reader035.vdocuments.net/reader035/viewer/2022081603/56649e635503460f94b5fca4/html5/thumbnails/17.jpg)
17
Pedro Furtado, DOLAP 2004 Pedro Furtado, DOLAP 2004
Very Low or No Speedup Very Low or No Speedup Queries:Queries:
Speedup 0.4-2
S
C
O
Li
P
Q13, Q22:
PS
•Process only replicated relations
0 0.5 1 1.5 2
Q8
Q22
Q4
Q13
Q21
Q2
Que
ry
Speedup
S
C
O
Li
P
Q8:
PS
•Includes all replicated relations
Q4, Q21, Q2:
•Scenarios Similar to “Slow Queries”
![Page 18: 1 Experimental Evidence on Partitioning in Parallel Data Warehouses Pedro Furtado Prof. at Univ. of Coimbra & Researcher at CISUC DEI/CISUC-Universidade](https://reader035.vdocuments.net/reader035/viewer/2022081603/56649e635503460f94b5fca4/html5/thumbnails/18.jpg)
18
Pedro Furtado, DOLAP 2004 Pedro Furtado, DOLAP 2004
What Happened…What Happened…
• Not only includes replicated relations…
• But also replicated relations included are very large in
comparison to fragments!
constRN
R ,
21
const
N
R
N
R ,21
N
RR 1
2
![Page 19: 1 Experimental Evidence on Partitioning in Parallel Data Warehouses Pedro Furtado Prof. at Univ. of Coimbra & Researcher at CISUC DEI/CISUC-Universidade](https://reader035.vdocuments.net/reader035/viewer/2022081603/56649e635503460f94b5fca4/html5/thumbnails/19.jpg)
19
Pedro Furtado, DOLAP 2004 Pedro Furtado, DOLAP 2004
The same in pictures…The same in pictures…
• Medium speedup
• Low speedup
S
C
O
Li
P
PS
S
C
O
Li
P
PS
•O is large relative to Li/N
• Large speedup
S
C
O
Li
P
PS
• No speedup at all
S
C
O
Li
P
PS
•O is large relative to Li/N
![Page 20: 1 Experimental Evidence on Partitioning in Parallel Data Warehouses Pedro Furtado Prof. at Univ. of Coimbra & Researcher at CISUC DEI/CISUC-Universidade](https://reader035.vdocuments.net/reader035/viewer/2022081603/56649e635503460f94b5fca4/html5/thumbnails/20.jpg)
20
Pedro Furtado, DOLAP 2004 Pedro Furtado, DOLAP 2004
Back to Partitioning Alternatives…Back to Partitioning Alternatives…• Placement alternatives: relation in Single Node vs Replicated (all nodes) vs
Partitioned
• Partitioning function (Round-robin/Random, Range, HASH)
• Choice of Partitioning attributes
ProductSupplyHistory
(PS)
Orders(O)Lineitem
(LI)
? ?PS_key
O key
Customer(C)?
C key
• Repartitioning = re-hash by exchanging rows between nodes
• When you partition more than 1 rel => will probably need to
repartition
• e.g.: If you partition LI and O by O_KEY = “equi-partitioned”
… LI join PS needs repartitioning of LI
… O join C needs repartitioning of O
![Page 21: 1 Experimental Evidence on Partitioning in Parallel Data Warehouses Pedro Furtado Prof. at Univ. of Coimbra & Researcher at CISUC DEI/CISUC-Universidade](https://reader035.vdocuments.net/reader035/viewer/2022081603/56649e635503460f94b5fca4/html5/thumbnails/21.jpg)
21
Pedro Furtado, DOLAP 2004 Pedro Furtado, DOLAP 2004
Lets Review Related Work…Lets Review Related Work…• Replicate all but one relation – PRS [Yu et al., TKDE89]
– Similar to what we did: replicated all except LI
[Yu et al., TKDE89]: “Partition strategy for distributed query processing in fast local
networks”
• Partition using dependencies - PLACEMENT DEPENDENCY [Liu et al, ICDE96]
– e.g. partition ORDERs and Co-locate its LINEITEM rows (LI is the dependant relation)
[Liu et al, ICDE96]: “A Distributed Query Processing Strategy Using Placement Dependency”
[Chen et al, ICPADS 2000]: “An Efficient Algorithm for Distributed Queries Using
Partition Dependency”.
• Parallel Hash Join and Optimization - PHJ– Relations are hash-partitioned, Repartitioning required to re-hash in order to JOIN
[DeWitt et al., VLDB11]: “Multiprocessor Hash-Based Join Algorithms”
[Liu et al, EDBT96]: “A Hash Partition Strategy for Distributed Query Processing”
[Kitsuregawa et al., 1983 ], “Application of hash to database machine and its architecture”
[Shasha et al., TODS91]: “Optimizing Equijoin Queries In Distributed Databases … Hash
Partitioned”.
• Workload-based Partitioning and Placement– Determine best partitioning attributes automatically, based on the workload
• [Daniel Zilio et al. 1994], “Partitioning Key Selection for a Shared-Nothing Parallel Database System”
• [Rao et al., SIGMOD 2000]: Automating physical database design in a parallel database.
![Page 22: 1 Experimental Evidence on Partitioning in Parallel Data Warehouses Pedro Furtado Prof. at Univ. of Coimbra & Researcher at CISUC DEI/CISUC-Universidade](https://reader035.vdocuments.net/reader035/viewer/2022081603/56649e635503460f94b5fca4/html5/thumbnails/22.jpg)
22
Pedro Furtado, DOLAP 2004 Pedro Furtado, DOLAP 2004
Local Replicated Join:Local Replicated Join:
• Join Fragment to replicated relation locally, no data
exchanged
• One Relation must be Replicated – E.g. LI(O_KEY), O()
Costlocal replicated join=
N
RR 2
1
N nodes, relations R, constant
![Page 23: 1 Experimental Evidence on Partitioning in Parallel Data Warehouses Pedro Furtado Prof. at Univ. of Coimbra & Researcher at CISUC DEI/CISUC-Universidade](https://reader035.vdocuments.net/reader035/viewer/2022081603/56649e635503460f94b5fca4/html5/thumbnails/23.jpg)
23
Pedro Furtado, DOLAP 2004 Pedro Furtado, DOLAP 2004
Local Partitioned JoinLocal Partitioned Join• Join fragments locally, no data exchanged
• Relations must be equi-partitioned– E.g. LI(O_KEY), O(O_KEY)
Costlocal join=
N
R
N
R 21
N nodes, relations R, constant
![Page 24: 1 Experimental Evidence on Partitioning in Parallel Data Warehouses Pedro Furtado Prof. at Univ. of Coimbra & Researcher at CISUC DEI/CISUC-Universidade](https://reader035.vdocuments.net/reader035/viewer/2022081603/56649e635503460f94b5fca4/html5/thumbnails/24.jpg)
24
Pedro Furtado, DOLAP 2004 Pedro Furtado, DOLAP 2004
Repartition JoinRepartition Join
• Re-hash with data exchange, then join locally
• Relation Partitions are not co-located– E.g. O(O_KEY), C(C_KEY)
CostRepartition join=
N
R
N
R
N
R
N
R 212
11
, constant weight factors
Depends on network configuration
![Page 25: 1 Experimental Evidence on Partitioning in Parallel Data Warehouses Pedro Furtado Prof. at Univ. of Coimbra & Researcher at CISUC DEI/CISUC-Universidade](https://reader035.vdocuments.net/reader035/viewer/2022081603/56649e635503460f94b5fca4/html5/thumbnails/25.jpg)
25
Pedro Furtado, DOLAP 2004 Pedro Furtado, DOLAP 2004
Proposed SolutionProposed Solution
• “Very Small” Dimensions– Replicate
– “Very small” depends on relation sizes and nº of nodes
• Non-small Dimensions– Hash-Partition by PRIMARY KEY
• because they “always” join based on PK (with facts)
• like in placement-dependency, we take advantage of invariant
• Facts– Find hash-partitioning attribute that minimizes repartitioning costs
– Reasonable approximation: most frequent equi-join attr.
![Page 26: 1 Experimental Evidence on Partitioning in Parallel Data Warehouses Pedro Furtado Prof. at Univ. of Coimbra & Researcher at CISUC DEI/CISUC-Universidade](https://reader035.vdocuments.net/reader035/viewer/2022081603/56649e635503460f94b5fca4/html5/thumbnails/26.jpg)
26
Pedro Furtado, DOLAP 2004 Pedro Furtado, DOLAP 2004
Result of Partitioning (TPC-H)Result of Partitioning (TPC-H)
O Li
P
PS
O_KEY
S
C
O_KEYP_KEY
P_KEY
Local Join (equi-partitioned)
Replicated Join (with small dimension)
Repartitioned Join
![Page 27: 1 Experimental Evidence on Partitioning in Parallel Data Warehouses Pedro Furtado Prof. at Univ. of Coimbra & Researcher at CISUC DEI/CISUC-Universidade](https://reader035.vdocuments.net/reader035/viewer/2022081603/56649e635503460f94b5fca4/html5/thumbnails/27.jpg)
27
Pedro Furtado, DOLAP 2004 Pedro Furtado, DOLAP 2004
Experimental ResultsExperimental Results
0 10 20 30
Q6
Q1
Q15
Que
ry
Speedup0 5 10 15 20 25
Q19
Q11
Q14
Que
ry
Speedup
0 10 20 30
Q7
Q5
Q9
Q3
Q16
Q12
Q10
Que
ry
Speedup
0 10 20 30
Q8
Q4
Q13
Q2
Speedup
Ship only selected rows from LI …
LI join P
LI join P
![Page 28: 1 Experimental Evidence on Partitioning in Parallel Data Warehouses Pedro Furtado Prof. at Univ. of Coimbra & Researcher at CISUC DEI/CISUC-Universidade](https://reader035.vdocuments.net/reader035/viewer/2022081603/56649e635503460f94b5fca4/html5/thumbnails/28.jpg)
28
Pedro Furtado, DOLAP 2004 Pedro Furtado, DOLAP 2004
Repartition VS Total RuntimeRepartition VS Total Runtime
• TC = total runtime
• RC = repartition time
• Repartition time is reasonably small…
• Depends on: number of nodes + selectivities
– (can be very dependent on selection conditions of specific query)
0
100
200
300
Q8 Q9 Q14 Q19Queries Requiring Repartitioning
runt
ime
(sec
s)
TC RC
0%10%20%30%40%
Q8 Q9 Q14 Q19Queries Requiring Repartitioning
% o
verh
ead
RC/TC
![Page 29: 1 Experimental Evidence on Partitioning in Parallel Data Warehouses Pedro Furtado Prof. at Univ. of Coimbra & Researcher at CISUC DEI/CISUC-Universidade](https://reader035.vdocuments.net/reader035/viewer/2022081603/56649e635503460f94b5fca4/html5/thumbnails/29.jpg)
29
Pedro Furtado, DOLAP 2004 Pedro Furtado, DOLAP 2004
ConclusionsConclusions
• We have analyzed a basic partitioning strategy (PRS-like)– Largest Relation is partitioned, the others are replicated
– The speedup is totally unsatisfactory for many queries
• We analyzed why this happens: explained by access patterns to
replicated relations
• We tried very simple partitioning alternative– Only very small relations are replicated
– Dimensions are partitioned by Primary Key
– Hash-partition facts, partitioning key = most frequent join attr
• We have shown that it works well – prevents very low speedup
– provides near to linear speedup for most queries
![Page 30: 1 Experimental Evidence on Partitioning in Parallel Data Warehouses Pedro Furtado Prof. at Univ. of Coimbra & Researcher at CISUC DEI/CISUC-Universidade](https://reader035.vdocuments.net/reader035/viewer/2022081603/56649e635503460f94b5fca4/html5/thumbnails/30.jpg)
30
Pedro Furtado, DOLAP 2004 Pedro Furtado, DOLAP 2004
•Thank You!
•Questions?
• www.eden.dei.uc.pt/~pnf