cmu scs 15-721 (spring 2017) :: parallel join algorithms
TRANSCRIPT
![Page 1: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/1.jpg)
Andy Pavlo // Carnegie Mellon University // Spring 2016
ADVANCED DATABASE SYSTEMS
Lecture #18 – Parallel Join Algorithms (Hashing)
15-721
@Andy_Pavlo // Carnegie Mellon University // Spring 2017
![Page 2: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/2.jpg)
CMU 15-721 (Spring 2017)
TODAY’S AGENDA
Background Parallel Hash Join Hash Functions Hash Table Implementations Evaluation
2
![Page 3: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/3.jpg)
CMU 15-721 (Spring 2017)
PARALLEL JOIN ALGORITHMS
Perform a join between two relations on multiple threads simultaneously to speed up operation.
Two main approaches: → Hash Join
→ Sort-Merge Join
We won’t discuss nested-loop joins…
3
![Page 4: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/4.jpg)
CMU 15-721 (Spring 2017)
OBSERVATION
Many OLTP DBMSs don’t implement hash join.
But a index nested-loop join with a small number of target tuples is more or less equivalent to a hash join.
4
![Page 5: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/5.jpg)
CMU 15-721 (Spring 2017)
HASHING VS. SORTING
1970s – Sorting 1980s – Hashing 1990s – Equivalent 2000s – Hashing 2010s – ???
5
![Page 6: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/6.jpg)
CMU 15-721 (Spring 2017)
PARALLEL JOIN ALGORITHMS
6
→Hashing is faster than Sort-Merge. →Sort-Merge will be faster with wider SIMD.
SORT VS. HASH REVISITED: FAST JOIN IMPLEMENTATION ON MODERN MULTI-CORE CPUS VLDB 2009
→Sort-Merge is already faster, even without SIMD.
MASSIVELY PARALLEL SORT-MERGE JOINS IN MAIN MEMORY MULTI-CORE DATABASE SYSTEMS VLDB 2012
→New optimizations and results for Radix Hash Join.
MAIN-MEMORY HASH JOINS ON MULTI-CORE CPUS: TUNING TO THE UNDERLYING HARDWARE ICDE 2013
Source: Cagri Balkesen
![Page 7: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/7.jpg)
CMU 15-721 (Spring 2017)
JOIN ALGORITHM DESIGN GOALS
Goal #1: Minimize Synchronization
→ Avoid taking latches during execution.
Goal #2: Minimize CPU Cache Misses
→ Ensure that data is always local to worker thread.
7
![Page 8: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/8.jpg)
CMU 15-721 (Spring 2017)
IMPROVING CACHE BEHAVIOR
Factors that affect cache misses in a DBMS: → Cache + TLB capacity. → Locality (temporal and spatial).
Non-Random Access (Scan):
→ Clustering to a cache line. → Execute more operations per cache line.
Random Access (Lookups):
→ Partition data to fit in cache + TLB.
8
Source: Johannes Gehrke
![Page 9: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/9.jpg)
CMU 15-721 (Spring 2017)
PARALLEL HASH JOINS
Hash join is the most important operator in a DBMS for OLAP workloads.
It’s important that we speed it up by taking advantage of multiple cores. → We want to keep all of the cores busy, without becoming
memory bound
9
DESIGN AND EVALUATION OF MAIN MEMORY HASH JOIN ALGORITHMS FOR MULTI-CORE CPUS SIGMOD 2011
![Page 10: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/10.jpg)
CMU 15-721 (Spring 2017)
CLOUDERA IMPAL A
10
49.6%
25.0%
3.1%
19.9%
2.4%
HASH JOIN
SEQ SCAN
UNION
AGGREGATE
OTHER
% of Total CPU Time Spent in Query Operators
Workload: TPC-H Benchmark
![Page 11: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/11.jpg)
CMU 15-721 (Spring 2017)
HASH JOIN (R⨝S)
Phase #1: Partition (optional)
→ Divide the tuples of R and S into sets using a hash on the join key.
Phase #2: Build
→ Scan relation R and create a hash table on join key.
Phase #3: Probe
→ For each tuple in S, look up its join key in hash table for R. If a match is found, output combined tuple.
11
![Page 12: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/12.jpg)
CMU 15-721 (Spring 2017)
PARTITION PHASE
Split the input relations into partitioned buffers by hashing the tuples’ join key(s). → The hash function used for this phase should be different
than the one used in the build phase. → Ideally the cost of partitioning is less than the cost of
cache misses during build phase.
Contents of buffers depends on storage model: → NSM: Either the entire tuple or a subset of attributes. → DSM: Only the columns needed for the join + offset.
12
![Page 13: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/13.jpg)
CMU 15-721 (Spring 2017)
PARTITION PHASE
Approach #1: Non-Blocking Partitioning
→ Only scan the input relation once. → Produce output incrementally.
Approach #2: Blocking Partitioning (Radix)
→ Scan the input relation multiple times. → Only materialize results all at once.
13
![Page 14: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/14.jpg)
CMU 15-721 (Spring 2017)
NON-BLOCKING PARTITIONING
Scan the input relation only once and generate the output on-the-fly.
Approach #1: Shared Partitions
→ Single global set of partitions that all threads update. → Have to use a latch to synchronize threads.
Approach #2: Private Partitions
→ Each thread has its own set of partitions. → Have to consolidate them after all threads finish.
14
![Page 15: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/15.jpg)
CMU 15-721 (Spring 2017)
SHARED PARTITIONS
15
Data Table
A B C
![Page 16: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/16.jpg)
CMU 15-721 (Spring 2017)
SHARED PARTITIONS
15
Data Table
A B C hashP(key)
#p
#p
#p
![Page 17: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/17.jpg)
CMU 15-721 (Spring 2017)
Partitions
SHARED PARTITIONS
15
Data Table
A B C hashP(key)
P1
⋮
P2
Pn
#p
#p
#p
![Page 18: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/18.jpg)
CMU 15-721 (Spring 2017)
Partitions
SHARED PARTITIONS
15
Data Table
A B C hashP(key)
P1
⋮
P2
Pn
#p
#p
#p
![Page 19: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/19.jpg)
CMU 15-721 (Spring 2017)
Partitions
PRIVATE PARTITIONS
16
Data Table
A B C hashP(key)
#p
#p
#p
![Page 20: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/20.jpg)
CMU 15-721 (Spring 2017)
Partitions
PRIVATE PARTITIONS
16
Data Table
A B C hashP(key)
#p
#p
#p
![Page 21: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/21.jpg)
CMU 15-721 (Spring 2017)
Partitions
PRIVATE PARTITIONS
16
Data Table
A B C hashP(key)
#p
#p
#p
Combined
P1
⋮
P2
Pn
![Page 22: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/22.jpg)
CMU 15-721 (Spring 2017)
RADIX PARTITIONING
Scan the input relation multiple times to generate the partitions.
Multi-step pass over the relation: → Step #1: Scan R and compute a histogram of the # of
tuples per hash key for the radix at some offset. → Step #2: Use this histogram to determine output offsets
by computing the prefix sum. → Step #3: Scan R again and partition them according to the
hash key.
17
![Page 23: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/23.jpg)
CMU 15-721 (Spring 2017)
RADIX
The radix is the value of an integer at a particular position (using its base).
18
89 12 23 08 41 64 Input
![Page 24: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/24.jpg)
CMU 15-721 (Spring 2017)
RADIX
The radix is the value of an integer at a particular position (using its base).
18
89 12 23 08 41 64
9 2 3 8 1 4
Input
Radix
![Page 25: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/25.jpg)
CMU 15-721 (Spring 2017)
RADIX
The radix is the value of an integer at a particular position (using its base).
18
89 12 23 08 41 64 Input
Radix 8 1 2 0 4 6
![Page 26: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/26.jpg)
CMU 15-721 (Spring 2017)
PREFIX SUM
The prefix sum of a sequence of numbers (x
0, x
1, …, x
n)
is a second sequence of numbers (y
0, y1, …, y
n)
that is a running total of the input sequence.
19
1 2 3 4 5 6 Input
![Page 27: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/27.jpg)
CMU 15-721 (Spring 2017)
PREFIX SUM
The prefix sum of a sequence of numbers (x
0, x
1, …, x
n)
is a second sequence of numbers (y
0, y1, …, y
n)
that is a running total of the input sequence.
19
1 2 3 4 5 6
1
Input
Prefix Sum
![Page 28: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/28.jpg)
CMU 15-721 (Spring 2017)
PREFIX SUM
The prefix sum of a sequence of numbers (x
0, x
1, …, x
n)
is a second sequence of numbers (y
0, y1, …, y
n)
that is a running total of the input sequence.
19
+ 1 2 3 4 5 6
1 3
Input
Prefix Sum
![Page 29: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/29.jpg)
CMU 15-721 (Spring 2017)
PREFIX SUM
The prefix sum of a sequence of numbers (x
0, x
1, …, x
n)
is a second sequence of numbers (y
0, y1, …, y
n)
that is a running total of the input sequence.
19
+ + + + + 1 2 3 4 5 6
1 3 6 10 15 21
Input
Prefix Sum
![Page 30: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/30.jpg)
CMU 15-721 (Spring 2017)
RADIX PARTITIONS
20
Step #1: Inspect input,
create histograms
07 18 19 07 03 11 15 10
0
1
Source: Spyros Blanas
#p
#p
#p
#p
#p
#p
#p
#p
ha
sh
P(k
ey
)
![Page 31: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/31.jpg)
CMU 15-721 (Spring 2017)
RADIX PARTITIONS
20
Step #1: Inspect input,
create histograms
07 18 19 07 03 11 15 10
0
1
Source: Spyros Blanas
#p
#p
#p
#p
#p
#p
#p
#p
ha
sh
P(k
ey
)
![Page 32: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/32.jpg)
CMU 15-721 (Spring 2017)
RADIX PARTITIONS
20
Step #1: Inspect input,
create histograms
07 18 19 07 03 11 15 10
0
1
Source: Spyros Blanas
#p
#p
#p
#p
#p
#p
#p
#p
ha
sh
P(k
ey
)
![Page 33: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/33.jpg)
CMU 15-721 (Spring 2017)
RADIX PARTITIONS
20
Step #1: Inspect input,
create histograms
Partition 0: 2 Partition 1: 2
Partition 0: 1 Partition 1: 3
07 18 19 07 03 11 15 10
0
1
Source: Spyros Blanas
#p
#p
#p
#p
#p
#p
#p
#p
ha
sh
P(k
ey
)
![Page 34: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/34.jpg)
CMU 15-721 (Spring 2017)
RADIX PARTITIONS
20
Partition 0: 2 Partition 1: 2
Partition 0: 1 Partition 1: 3
Partition 0
Partition 0, CPU 1
Partition 1
Partition 1, CPU 1
Step #2: Compute output
offsets
, CPU 0
, CPU 0
07 18 19 07 03 11 15 10
0
1
Source: Spyros Blanas
#p
#p
#p
#p
#p
#p
#p
#p
ha
sh
P(k
ey
)
![Page 35: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/35.jpg)
CMU 15-721 (Spring 2017)
RADIX PARTITIONS
20
Partition 0: 2 Partition 1: 2
Partition 0: 1 Partition 1: 3
Partition 0
Partition 0, CPU 1
Partition 1
Partition 1, CPU 1
Step #3: Read input
and partition
07
03
, CPU 0
, CPU 0
07 18 19 07 03 11 15 10
0
1
Source: Spyros Blanas
#p
#p
#p
#p
#p
#p
#p
#p
ha
sh
P(k
ey
)
![Page 36: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/36.jpg)
CMU 15-721 (Spring 2017)
RADIX PARTITIONS
20
Partition 0: 2 Partition 1: 2
Partition 0: 1 Partition 1: 3
Partition 0
Partition 0, CPU 1
Partition 1
Partition 1, CPU 1
Step #3: Read input
and partition
07 07 03 18 19 11 15 10
, CPU 0
, CPU 0
07 18 19 07 03 11 15 10
0
1
Source: Spyros Blanas
#p
#p
#p
#p
#p
#p
#p
#p
ha
sh
P(k
ey
)
![Page 37: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/37.jpg)
CMU 15-721 (Spring 2017)
RADIX PARTITIONS
20
Partition 0: 2 Partition 1: 2
Partition 0: 1 Partition 1: 3
Partition 0
Partition 1
07 07 03 18 19 11 15 10
07 18 19 07 03 11 15 10
0
1
Source: Spyros Blanas
#p
#p
#p
#p
#p
#p
#p
#p
ha
sh
P(k
ey
)
![Page 38: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/38.jpg)
CMU 15-721 (Spring 2017)
RADIX PARTITIONS
20
Partition 0: 2 Partition 1: 2
Partition 0: 1 Partition 1: 3
Partition 0
Partition 1
07 07 03 18 19 11 15 10
Recursively repeat until target number of
partitions have been created
07 18 19 07 03 11 15 10
0
1
Source: Spyros Blanas
#p
#p
#p
#p
#p
#p
#p
#p
ha
sh
P(k
ey
)
![Page 39: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/39.jpg)
CMU 15-721 (Spring 2017)
RADIX PARTITIONS
20
Partition 0: 2 Partition 1: 2
Partition 0: 1 Partition 1: 3
07 07 03 18 19 11 15 10
Recursively repeat until target number of
partitions have been created
07 18 19 07 03 11 15 10
0
1
Source: Spyros Blanas
0
1
#p
#p
#p
#p
#p
#p
#p
#p
ha
sh
P(k
ey
)
![Page 40: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/40.jpg)
CMU 15-721 (Spring 2017)
RADIX PARTITIONS
20
Partition 0: 2 Partition 1: 2
Partition 0: 1 Partition 1: 3
07 07 03 18 19 11 15 10
Recursively repeat until target number of
partitions have been created
07 18 19 07 03 11 15 10
0
1
Source: Spyros Blanas
0
1
#p
#p
#p
#p
#p
#p
#p
#p
ha
sh
P(k
ey
)
![Page 41: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/41.jpg)
CMU 15-721 (Spring 2017)
BUILD PHASE
The threads are then to scan either the tuples (or partitions) of R. For each tuple, hash the join key attribute for that tuple and add it to the appropriate bucket in the hash table. → The buckets should only be a few cache lines in size. → The hash function must be different than the one that
was used in the partition phase.
21
![Page 42: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/42.jpg)
CMU 15-721 (Spring 2017)
HASH FUNCTIONS
We don’t want to use a cryptographic hash function for our join algorithm.
We want something that is fast and will have a low collision rate.
22
![Page 43: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/43.jpg)
CMU 15-721 (Spring 2017)
HASH FUNCTIONS
MurmurHash (2008)
→ Designed to a fast, general purpose hash function.
Google CityHash (2011)
→ Based on ideas from MurmurHash2 → Designed to be faster for short keys (<64 bytes).
Google FarmHash (2014)
→ Newer version of CityHash with better collision rates.
CLHash (2016)
→ Fast hashing function based on carry-less multiplication.
23
![Page 44: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/44.jpg)
CMU 15-721 (Spring 2017)
HASH FUNCTION BENCHMARKS
24
0
4000
8000
12000
1 51 101 151 201 251
Th
ro
ug
hp
ut (M
B/
se
c)
Key Size (bytes)
std::hash MurmurHash3 CityHash FarmHash CLHash
Source: Fredrik Widlund
Intel Xeon CPU E5-2630v4 @ 2.20GHz
32 64
128
192
![Page 45: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/45.jpg)
CMU 15-721 (Spring 2017)
HASH FUNCTION BENCHMARKS
25
0
6000
12000
18000
1 51 101 151 201 251
Th
ro
ug
hp
ut (M
B/
se
c)
Key Size (bytes)
std::hash MurmurHash3 CityHash FarmHash CLHash
Source: Fredrik Widlund
Intel Xeon CPU E5-2630v4 @ 2.20GHz
32 64
128
192
![Page 46: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/46.jpg)
CMU 15-721 (Spring 2017)
HASH TABLE IMPLEMENTATIONS
Approach #1: Chained Hash Table
Approach #2: Open-Addressing Hash Table
Approach #3: Cuckoo Hash Table
26
![Page 47: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/47.jpg)
CMU 15-721 (Spring 2017)
CHAINED HASH TABLE
Maintain a linked list of “buckets” for each slot in the hash table. Resolve collisions by placing all elements with the same hash key into the same bucket. → To determine whether an element is present, hash to its
bucket and scan for it. → Insertions and deletions are generalizations of lookups.
27
![Page 48: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/48.jpg)
CMU 15-721 (Spring 2017)
CHAINED HASH TABLE
28
Ø
hashB(key)
⋮ ⋮
![Page 49: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/49.jpg)
CMU 15-721 (Spring 2017)
OPEN-ADDRESSING HASH TABLE
Single giant table of slots. Resolve collisions by linearly searching for the next free slot in the table. → To determine whether an element is present, hash to a
location in the table and scan for it. → Have to store the key in the table to know when to stop
scanning. → Insertions and deletions are generalizations of lookups.
29
![Page 50: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/50.jpg)
CMU 15-721 (Spring 2017)
OPEN-ADDRESSING HASH TABLE
30
X
Y
Z
hashB(key)
⋮
⋮
| X hashB(X)
![Page 51: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/51.jpg)
CMU 15-721 (Spring 2017)
OPEN-ADDRESSING HASH TABLE
30
X
Y
Z
hashB(key)
⋮
⋮
| X hashB(X)
| Y hashB(Y)
![Page 52: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/52.jpg)
CMU 15-721 (Spring 2017)
OPEN-ADDRESSING HASH TABLE
30
X
Y
Z
hashB(key)
⋮
⋮
| X hashB(X)
| Y hashB(Y)
![Page 53: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/53.jpg)
CMU 15-721 (Spring 2017)
OPEN-ADDRESSING HASH TABLE
30
X
Y
Z
hashB(key)
⋮
⋮
| X hashB(X)
| Y hashB(Y)
| Z hashB(Z)
![Page 54: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/54.jpg)
CMU 15-721 (Spring 2017)
OBSERVATION
To reduce the # of wasteful comparisons during the join, it is important to avoid collisions of hashed keys.
This requires a chained hash table with ~2x the number of slots as the # of elements in R.
31
![Page 55: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/55.jpg)
CMU 15-721 (Spring 2017)
CUCKOO HASH TABLE
Use multiple hash tables with different hash functions. → On insert, check every table and pick anyone that has a
free slot. → If no table has a free slot, evict the element from one of
them and then re-hash it find a new location.
Look-ups and deletions are always O(1) because only one location per hash table is checked.
32
![Page 56: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/56.jpg)
CMU 15-721 (Spring 2017)
CUCKOO HASH TABLE
33
Hash Table #1
⋮
Hash Table #2
⋮
![Page 57: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/57.jpg)
CMU 15-721 (Spring 2017)
CUCKOO HASH TABLE
33
Hash Table #1
⋮
Hash Table #2
⋮
Insert X
hashB1
(X) hashB2
(X)
![Page 58: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/58.jpg)
CMU 15-721 (Spring 2017)
CUCKOO HASH TABLE
33
Hash Table #1
⋮
Hash Table #2
⋮
Insert X
hashB1
(X) hashB2
(X)
X
![Page 59: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/59.jpg)
CMU 15-721 (Spring 2017)
CUCKOO HASH TABLE
33
Hash Table #1
⋮
Hash Table #2
⋮
Insert X
hashB1
(X) hashB2
(X)
Insert Y
hashB1
(Y) hashB2
(Y)
X
![Page 60: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/60.jpg)
CMU 15-721 (Spring 2017)
CUCKOO HASH TABLE
33
Hash Table #1
⋮
Hash Table #2
⋮
Insert X
hashB1
(X) hashB2
(X)
Insert Y
hashB1
(Y) hashB2
(Y)
X
Y
![Page 61: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/61.jpg)
CMU 15-721 (Spring 2017)
CUCKOO HASH TABLE
33
Hash Table #1
⋮
Hash Table #2
⋮
Insert X
hashB1
(X) hashB2
(X)
Insert Y
hashB1
(Y) hashB2
(Y)
X
Y
Insert Z
hashB1
(Z) hashB2
(Z)
![Page 62: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/62.jpg)
CMU 15-721 (Spring 2017)
CUCKOO HASH TABLE
33
Hash Table #1
⋮
Hash Table #2
⋮
Insert X
hashB1
(X) hashB2
(X)
Insert Y
hashB1
(Y) hashB2
(Y)
X
Insert Z
hashB1
(Z) hashB2
(Z)
Z
![Page 63: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/63.jpg)
CMU 15-721 (Spring 2017)
CUCKOO HASH TABLE
33
Hash Table #1
⋮
Hash Table #2
⋮
Insert X
hashB1
(X) hashB2
(X)
Insert Y
hashB1
(Y) hashB2
(Y)
Insert Z
hashB1
(Z) hashB2
(Z)
Z
hashB1
(Y)
Y
![Page 64: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/64.jpg)
CMU 15-721 (Spring 2017)
CUCKOO HASH TABLE
33
Hash Table #1
⋮
Hash Table #2
⋮
Insert X
hashB1
(X) hashB2
(X)
Insert Y
hashB1
(Y) hashB2
(Y)
Insert Z
hashB1
(Z) hashB2
(Z)
Z
hashB1
(Y)
Y
hashB2
(X)
X
![Page 65: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/65.jpg)
CMU 15-721 (Spring 2017)
CUCKOO HASH TABLE
We have to make sure that we don’t get stuck in an infinite loop when moving keys.
If we find a cycle, then we can rebuild the entire hash tables with new hash functions. → With two hash functions, we (probably) won’t need to
rebuild the table until it is at about 50% full. → With three hash functions, we (probably) won’t need to
rebuild the table until it is at about 90% full.
34
![Page 66: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/66.jpg)
CMU 15-721 (Spring 2017)
PROBE PHASE
For each tuple in S, hash its join key and check to see whether there is a match for each tuple in corresponding bucket in the hash table constructed for R. → If inputs were partitioned, then assign each thread a
unique partition. → Otherwise, synchronize their access to the cursor on S
35
![Page 67: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/67.jpg)
CMU 15-721 (Spring 2017)
HASH JOIN VARIANTS
No Partitioning + Shared Hash Table
Non-Blocking Partitioning + Shared Buffers
Non-Blocking Partitioning + Private Buffers
Blocking (Radix) Partitioning
36
![Page 68: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/68.jpg)
CMU 15-721 (Spring 2017)
HASH JOIN VARIANTS
37
No-P Shared-P Private-P Radix
Partitioning No Yes Yes Yes
Input scans 0 1 1 2
Sync during partitioning – Spinlock
per tuple Barrier,
once at end Barrier,
4 * #passes
Hash table Shared Private Private Private
Sync during build phase Yes No No No
Sync during probe phase No No No No
![Page 69: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/69.jpg)
CMU 15-721 (Spring 2017)
BENCHMARKS
Primary key – foreign key join → Outer Relation (Build): 16M tuples, 16 bytes each → Inner Relation (Probe): 256M tuples, 16 bytes each
Uniform and highly skewed (Zipf; s=1.25)
No output materialization
38
DESIGN AND EVALUATION OF MAIN MEMORY HASH JOIN ALGORITHMS FOR MULTI-CORE CPUS SIGMOD 2011
![Page 70: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/70.jpg)
CMU 15-721 (Spring 2017)
HASH JOIN – UNIFORM DATA SET
39
0
40
80
120
160
No Partitioning Shared
Partitioning
Private
Partitioning
Radix
Cy
cle
s /
O
utp
ut T
up
le
Partition Build Probe
Intel Xeon CPU X5650 @ 2.66GHz
6 Cores with 2 Threads Per Core
60.2 67.6
76.8
47.3
Source: Spyros Blanas
![Page 71: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/71.jpg)
CMU 15-721 (Spring 2017)
HASH JOIN – UNIFORM DATA SET
39
0
40
80
120
160
No Partitioning Shared
Partitioning
Private
Partitioning
Radix
Cy
cle
s /
O
utp
ut T
up
le
Partition Build Probe
Intel Xeon CPU X5650 @ 2.66GHz
6 Cores with 2 Threads Per Core
60.2 67.6
76.8
47.3
24% faster than
No Partitioning
3.3x cache misses
70x TLB misses
Source: Spyros Blanas
![Page 72: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/72.jpg)
CMU 15-721 (Spring 2017)
HASH JOIN – SKEWED DATA SET
40
0
40
80
120
160
No Partitioning Shared
Partitioning
Private
Partitioning
Radix
Cy
cle
s /
O
utp
ut T
up
le
Partition Build Probe
Intel Xeon CPU X5650 @ 2.66GHz
6 Cores with 2 Threads Per Core
25.2
167.1
56.5 50.7
Source: Spyros Blanas
![Page 73: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/73.jpg)
CMU 15-721 (Spring 2017)
OBSERVATION
We have ignored a lot of important parameters for all of these algorithms so far. → Whether to use partitioning or not? → How many partitions to use? → How many passes to take in partitioning phase?
In a real DBMS, the optimizer will select what it thinks are good values based on what it knows about the data (and maybe hardware).
41
![Page 74: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/74.jpg)
CMU 15-721 (Spring 2017)
RADIX HASH JOIN – UNIFORM DATA SET
42
0
40
80
120
64 256
512
1024
4096
8192
3276
8
1310
72 64 256
512
1024
4096
8192
3276
8
1310
72
Radix / 1-Pass Radix / 2-Pass
Cy
cle
s /
O
utp
ut T
up
le
Partition Build Probe
Intel Xeon CPU X5650 @ 2.66GHz
Varying the # of Partitions
No Partitioning
Source: Spyros Blanas
+24% -5%
![Page 75: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/75.jpg)
CMU 15-721 (Spring 2017)
RADIX HASH JOIN – UNIFORM DATA SET
43
0
40
80
120
64 256
512
1024
4096
8192
3276
8
1310
72 64 256
512
1024
4096
8192
3276
8
1310
72
Radix / 1-Pass Radix / 2-Pass
Cy
cle
s /
O
utp
ut T
up
le
Partition Build Probe
Intel Xeon CPU X5650 @ 2.66GHz
Varying the # of Partitions
No Partitioning
Source: Spyros Blanas
![Page 76: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/76.jpg)
CMU 15-721 (Spring 2017)
EFFECTS OF HYPER-THREADING
44
1
3
5
7
9
11
1 3 5 7 9 11
Sp
ee
du
p
Threads
No Partitioning Radix Ideal
Hyper-Threading
Intel Xeon CPU X5650 @ 2.66GHz
Uniform Data Set
Source: Spyros Blanas
![Page 77: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/77.jpg)
CMU 15-721 (Spring 2017)
EFFECTS OF HYPER-THREADING
44
1
3
5
7
9
11
1 3 5 7 9 11
Sp
ee
du
p
Threads
No Partitioning Radix Ideal
Hyper-Threading
Multi-threading hides cache & TLB miss latency.
Intel Xeon CPU X5650 @ 2.66GHz
Uniform Data Set
Source: Spyros Blanas
![Page 78: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/78.jpg)
CMU 15-721 (Spring 2017)
EFFECTS OF HYPER-THREADING
44
1
3
5
7
9
11
1 3 5 7 9 11
Sp
ee
du
p
Threads
No Partitioning Radix Ideal
Hyper-Threading
Radix join has fewer cache & TLB misses but this has marginal benefit.
Multi-threading hides cache & TLB miss latency.
Intel Xeon CPU X5650 @ 2.66GHz
Uniform Data Set
Source: Spyros Blanas
![Page 79: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/79.jpg)
CMU 15-721 (Spring 2017)
EFFECTS OF HYPER-THREADING
44
1
3
5
7
9
11
1 3 5 7 9 11
Sp
ee
du
p
Threads
No Partitioning Radix Ideal
Hyper-Threading
Non-partitioned join relies on multi-threading for high performance.
Radix join has fewer cache & TLB misses but this has marginal benefit.
Multi-threading hides cache & TLB miss latency.
Intel Xeon CPU X5650 @ 2.66GHz
Uniform Data Set
Source: Spyros Blanas
![Page 80: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/80.jpg)
CMU 15-721 (Spring 2017)
PARTING THOUGHTS
On modern CPUs, a simple hash join algorithm that does not partition inputs is competitive.
There are additional vectorization execution optimizations that are possible in hash joins that we didn’t talk about. But these don’t really help…
45
![Page 81: CMU SCS 15-721 (Spring 2017) :: Parallel Join Algorithms](https://reader031.vdocuments.net/reader031/viewer/2022012414/616dfb8edb4b332e2e748c43/html5/thumbnails/81.jpg)
CMU 15-721 (Spring 2017)
NEXT CL ASS
Parallel Sort-Merge Joins
46