![Page 1: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/1.jpg)
Vectorized Execution (Part I)
@Andy_Pavlo // 15-721 // Spring 2018
ADVANCEDDATABASE SYSTEMS
Le
ctu
re #
21
![Page 2: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/2.jpg)
CMU 15-721 (Spring 2018)
Background
Hardware
Vectorized Algorithms (Columbia)
2
![Page 3: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/3.jpg)
CMU 15-721 (Spring 2018)
VECTORIZATION
The process of converting an algorithm's scalar implementation that processes a single pair of operands at a time, to a vector implementation that processes one operation on multiple pairs of operands at once.
3
![Page 4: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/4.jpg)
CMU 15-721 (Spring 2018)
WHY THIS MAT TERS
Say we can parallelize our algorithm over 32 cores.
Each core has a 4-wide SIMD registers.
Potential Speed-up: 32x × 4x = 128x
4
![Page 5: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/5.jpg)
CMU 15-721 (Spring 2018)
MULTI-CORE CPUS
Use a small number of high-powered cores. → Intel Xeon Skylake / Kaby Lake→ High power consumption and area per core.
Massively superscalar and aggressive out-of-order execution→ Instructions are issued from a sequential stream.→ Check for dependencies between instructions.→ Process multiple instructions per clock cycle.
5
![Page 6: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/6.jpg)
CMU 15-721 (Spring 2018)
MANY INTEGRATED CORES (MIC)
Use a larger number of low-powered cores.→ Intel Xeon Phi→ Low power consumption and area per core.→ Expanded SIMD instructions with larger register sizes.
Knights Ferry (Columbia Paper)→ Non-superscalar and in-order execution→ Cores = Intel P54C (aka Pentium from the 1990s).
Knights Landing (Since 2016)→ Superscalar and out-of-order execution.→ Cores = Silvermont (aka Atom)
6
![Page 7: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/7.jpg)
CMU 15-721 (Spring 2018)
MANY INTEGRATED CORES (MIC)
Use a larger number of low-powered cores.→ Intel Xeon Phi→ Low power consumption and area per core.→ Expanded SIMD instructions with larger register sizes.
Knights Ferry (Columbia Paper)→ Non-superscalar and in-order execution→ Cores = Intel P54C (aka Pentium from the 1990s).
Knights Landing (Since 2016)→ Superscalar and out-of-order execution.→ Cores = Silvermont (aka Atom)
6
![Page 8: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/8.jpg)
CMU 15-721 (Spring 2018)
S INGLE INSTRUCTION, MULTIPLE DATA
A class of CPU instructions that allow the processor to perform the same operation on multiple data points simultaneously.
All major ISAs have microarchitecture support SIMD operations.→ x86: MMX, SSE, SSE2, SSE3, SSE4, AVX, AVX2, AVX512→ PowerPC: Altivec→ ARM: NEON
8
![Page 9: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/9.jpg)
CMU 15-721 (Spring 2018)
Z
SIMD EXAMPLE
9
X + Y = Z 87654321
X
for (i=0; i<n; i++) {Z[i] = X[i] + Y[i];
}
11111111
Y
x1
x2
⋮xn
y1
y2
⋮yn
x1+y1
x2+y2
⋮xn+yn
+ =
![Page 10: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/10.jpg)
CMU 15-721 (Spring 2018)
Z
SIMD EXAMPLE
9
X + Y = Z 87654321
X
SISD
+for (i=0; i<n; i++) {Z[i] = X[i] + Y[i];
}
911111111
Y
x1
x2
⋮xn
y1
y2
⋮yn
x1+y1
x2+y2
⋮xn+yn
+ =
![Page 11: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/11.jpg)
CMU 15-721 (Spring 2018)
Z
SIMD EXAMPLE
9
X + Y = Z 87654321
X
SISD
+for (i=0; i<n; i++) {Z[i] = X[i] + Y[i];
}
9 8 7 6 5 4 3 211111111
Y
x1
x2
⋮xn
y1
y2
⋮yn
x1+y1
x2+y2
⋮xn+yn
+ =
![Page 12: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/12.jpg)
CMU 15-721 (Spring 2018)
Z
SIMD EXAMPLE
9
X + Y = Z 87654321
X
for (i=0; i<n; i++) {Z[i] = X[i] + Y[i];
}
9 8 7 611111111
Y
SIMD
+
8 7 6 5
1 1 1 1
128-bit SIMD Register
128-bit SIMD Register128-bit SIMD Register
x1
x2
⋮xn
y1
y2
⋮yn
x1+y1
x2+y2
⋮xn+yn
+ =
![Page 13: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/13.jpg)
CMU 15-721 (Spring 2018)
Z
SIMD EXAMPLE
9
X + Y = Z 87654321
X
for (i=0; i<n; i++) {Z[i] = X[i] + Y[i];
}
9 8 7 6 5 4 3 211111111
Y
SIMD
+4 3 2 1
1 1 1 1
x1
x2
⋮xn
y1
y2
⋮yn
x1+y1
x2+y2
⋮xn+yn
+ =
![Page 14: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/14.jpg)
CMU 15-721 (Spring 2018)
STREAMING SIMD EXTENSIONS (SSE)
SSE is a collection SIMD instructions that target special 128-bit SIMD registers.
These registers can be packed with four 32-bit scalars after which an operation can be performed on each of the four elements simultaneously.
First introduced by Intel in 1999.
10
![Page 15: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/15.jpg)
CMU 15-721 (Spring 2018)
SIMD INSTRUCTIONS (1 )
Data Movement→ Moving data in and out of vector registers
Arithmetic Operations→ Apply operation on multiple data items (e.g., 2 doubles, 4
floats, 16 bytes)→ Example: ADD, SUB, MUL, DIV, SQRT, MAX, MIN
Logical Instructions→ Logical operations on multiple data items→ Example: AND, OR, XOR, ANDN, ANDPS, ANDNPS
11
![Page 16: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/16.jpg)
CMU 15-721 (Spring 2018)
SIMD INSTRUCTIONS (2)
Comparison Instructions→ Comparing multiple data items (==,<,<=,>,>=,!=)
Shuffle instructions→ Move data in between SIMD registers
Miscellaneous→ Conversion: Transform data between x86 and SIMD
registers.→ Cache Control: Move data directly from SIMD registers
to memory (bypassing CPU cache).
12
![Page 17: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/17.jpg)
CMU 15-721 (Spring 2018)
INTEL SIMD EXTENSIONS
13
Width Integers Single-P Double-P
1997 MMX 64 bits ✔
1999 SSE 128 bits ✔ ✔(×4)
2001 SSE2 128 bits ✔ ✔ ✔(×2)
2004 SSE3 128 bits ✔ ✔ ✔
2006 SSSE 3 128 bits ✔ ✔ ✔
2006 SSE 4.1 128 bits ✔ ✔ ✔
2008 SSE 4.2 128 bits ✔ ✔ ✔
2011 AVX 256 bits ✔ ✔(×8) ✔(×4)
2013 AVX2 256 bits ✔ ✔ ✔
2017 AVX-512 512 bits ✔ ✔(×16) ✔(×8)
Source: James Reinders
![Page 18: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/18.jpg)
CMU 15-721 (Spring 2018)
WHY NOT GPUS?
Moving data back and forth between DRAM and GPU is slow over PCI-E bus.
There are some newer GPU-enabled DBMSs→ Examples: MapD, SQream, Kinetica
Emerging co-processors that can share CPU’s memory may change this.→ Examples: AMD’s APU, Intel’s Knights Landing
14
![Page 19: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/19.jpg)
https://db.cs.cmu.edu/seminar2018
![Page 20: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/20.jpg)
CMU 15-721 (Spring 2018)
VECTORIZATION
Choice #1: Automatic Vectorization
Choice #2: Compiler Hints
Choice #3: Explicit Vectorization
16
Source: James Reinders
Ease of Use
ProgrammerControl
![Page 21: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/21.jpg)
CMU 15-721 (Spring 2018)
AUTOMATIC VECTORIZATION
The compiler can identify when instructions inside of a loop can be rewritten as a vectorizedoperation.
Works for simple loops only and is rare in database operators. Requires hardware support for SIMD instructions.
17
![Page 22: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/22.jpg)
CMU 15-721 (Spring 2018)
AUTOMATIC VECTORIZATION
This loop is not legal to automatically vectorize.
The code is written such that the addition is described as being done sequentially.
18
These might point to the same address!
void add(int *X,int *Y,int *Z) {
for (int i=0; i<MAX; i++) {Z[i] = X[i] + Y[i];
}}
*Z=*X+1
![Page 23: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/23.jpg)
CMU 15-721 (Spring 2018)
COMPILER HINTS
Provide the compiler with additional information about the code to let it know that is safe to vectorize.
Two approaches:→ Give explicit information about memory locations.→ Tell the compiler to ignore vector dependencies.
19
![Page 24: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/24.jpg)
CMU 15-721 (Spring 2018)
COMPILER HINTS
The restrict keyword in C++ tells the compiler that the arrays are distinct locations in memory.
20
void add(int *restrict X,int *restrict Y,int *restrict Z) {
for (int i=0; i<MAX; i++) {Z[i] = X[i] + Y[i];
}}
![Page 25: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/25.jpg)
CMU 15-721 (Spring 2018)
COMPILER HINTS
This pragma tells the compiler to ignore loop dependencies for the vectors.
It’s up to you make sure that this is correct.
21
void add(int *X,int *Y,int *Z) {
#pragma ivdepfor (int i=0; i<MAX; i++) {Z[i] = X[i] + Y[i];
}}
![Page 26: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/26.jpg)
CMU 15-721 (Spring 2018)
EXPLICIT VECTORIZATION
Use CPU intrinsics to manually marshal data between SIMD registers and execute vectorized instructions.
Potentially not portable.
22
![Page 27: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/27.jpg)
CMU 15-721 (Spring 2018)
EXPLICIT VECTORIZATION
Store the vectors in 128-bit SIMD registers.
Then invoke the intrinsic to add together the vectors and write them to the output location.
23
void add(int *X,int *Y,int *Z) {
__mm128i *vecX = (__m128i*)X;__mm128i *vecY = (__m128i*)Y;__mm128i *vecZ = (__m128i*)Z;for (int i=0; i<MAX/4; i++) {_mm_store_si128(vecZ++,_mm_add_epi32(*vecX, *vecY));
}}
![Page 28: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/28.jpg)
CMU 15-721 (Spring 2018)
VECTORIZATION DIRECTION
Approach #1: Horizontal→ Perform operation on all elements
together within a single vector.
Approach #2: Vertical→ Perform operation in an elementwise
manner on elements of each vector.
24
Source: Przemysław Karpiński
0 1 2 3
SIMD Add 6
0 1 2 3
SIMD Add
1 1 1 1
1 2 3 4
![Page 29: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/29.jpg)
CMU 15-721 (Spring 2018)
EXPLICIT VECTORIZATION
Linear Access Operators→ Predicate evaluation→ Compression
Ad-hoc Vectorization→ Sorting→ Merging
Composable Operations→ Multi-way trees→ Bucketized hash tables
25
Source: Orestis Polychroniou
![Page 30: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/30.jpg)
CMU 15-721 (Spring 2018)
VECTORIZED DBMS ALGORITHMS
Principles for efficient vectorization by using fundamental vector operations to construct more advanced functionality.→ Favor vertical vectorization by processing different input
data per lane.→ Maximize lane utilization by executing different things
per lane subset.
26
RETHINKING SIMD VECTORIZATION FOR IN-MEMORY DATABASESSIGMOD 2015
![Page 31: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/31.jpg)
CMU 15-721 (Spring 2018)
FUNDAMENTAL OPERATIONS
Selective Load
Selective Store
Selective Gather
Selective Scatter
27
![Page 32: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/32.jpg)
CMU 15-721 (Spring 2018)
FUNDAMENTAL VECTOR OPERATIONS
28
Selective Load
A B C DVector
Memory
0 1 0 1Mask
U V W X Y Z • • •
![Page 33: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/33.jpg)
CMU 15-721 (Spring 2018)
FUNDAMENTAL VECTOR OPERATIONS
28
Selective Load
A B C DVector
Memory
0 1 0 1Mask
U V W X Y Z • • •
![Page 34: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/34.jpg)
CMU 15-721 (Spring 2018)
FUNDAMENTAL VECTOR OPERATIONS
28
Selective Load
A B C DVector
Memory
0 1 0 1Mask
U V W X Y Z • • •
U
![Page 35: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/35.jpg)
CMU 15-721 (Spring 2018)
FUNDAMENTAL VECTOR OPERATIONS
28
Selective Load
A B C DVector
Memory
0 1 0 1Mask
U V W X Y Z • • •
U
![Page 36: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/36.jpg)
CMU 15-721 (Spring 2018)
FUNDAMENTAL VECTOR OPERATIONS
28
Selective Load
A B C DVector
Memory
0 1 0 1Mask
U V W X Y Z • • •
U V
![Page 37: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/37.jpg)
CMU 15-721 (Spring 2018)
FUNDAMENTAL VECTOR OPERATIONS
28
Selective Load Selective Store
A B C DVector
Memory
0 1 0 1Mask
U V W X Y Z • • •
U V
A B C DVector
U V W X Y Z • • •Memory
0 1 0 1Mask
![Page 38: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/38.jpg)
CMU 15-721 (Spring 2018)
FUNDAMENTAL VECTOR OPERATIONS
28
Selective Load Selective Store
A B C DVector
Memory
0 1 0 1Mask
U V W X Y Z • • •
U V
A B C DVector
U V W X Y Z • • •Memory
0 1 0 1Mask
![Page 39: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/39.jpg)
CMU 15-721 (Spring 2018)
FUNDAMENTAL VECTOR OPERATIONS
28
Selective Load Selective Store
A B C DVector
Memory
0 1 0 1Mask
U V W X Y Z • • •
U V
A B C DVector
U V W X Y Z • • •Memory
0 1 0 1Mask
B
![Page 40: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/40.jpg)
CMU 15-721 (Spring 2018)
FUNDAMENTAL VECTOR OPERATIONS
28
Selective Load Selective Store
A B C DVector
Memory
0 1 0 1Mask
U V W X Y Z • • •
U V
A B C DVector
U V W X Y Z • • •Memory
0 1 0 1Mask
B
![Page 41: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/41.jpg)
CMU 15-721 (Spring 2018)
FUNDAMENTAL VECTOR OPERATIONS
28
Selective Load Selective Store
A B C DVector
Memory
0 1 0 1Mask
U V W X Y Z • • •
U V
A B C DVector
U V W X Y Z • • •Memory
0 1 0 1Mask
B D
![Page 42: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/42.jpg)
CMU 15-721 (Spring 2018)
Selective Gather
FUNDAMENTAL VECTOR OPERATIONS
29
A B DValue Vector
Memory
2 1 5 3Index Vector
U V W X Y Z • • •
CA
![Page 43: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/43.jpg)
CMU 15-721 (Spring 2018)
Selective Gather
FUNDAMENTAL VECTOR OPERATIONS
29
A B DValue Vector
Memory
2 1 5 3Index Vector
U V W X Y Z • • •
CAW
0 21 3 54
![Page 44: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/44.jpg)
CMU 15-721 (Spring 2018)
Selective Gather
FUNDAMENTAL VECTOR OPERATIONS
29
A B DValue Vector
Memory
2 1 5 3Index Vector
U V W X Y Z • • •
CAW V XZ
0 21 3 54
![Page 45: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/45.jpg)
CMU 15-721 (Spring 2018)
Selective Gather
FUNDAMENTAL VECTOR OPERATIONS
29
Selective Scatter
A B DValue Vector
Memory
2 1 5 3Index Vector
U V W X Y Z • • • A B C DValue Vector
U V W X Y Z • • •Memory
2 1 5 3Index Vector
CAW V XZ
0 21 3 54
0 21 3 54
![Page 46: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/46.jpg)
CMU 15-721 (Spring 2018)
Selective Gather
FUNDAMENTAL VECTOR OPERATIONS
29
Selective Scatter
A B DValue Vector
Memory
2 1 5 3Index Vector
U V W X Y Z • • • A B C DValue Vector
U V W X Y Z • • •Memory
2 1 5 3Index Vector
CAW V XZ B A CD
0 21 3 54
0 21 3 54
![Page 47: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/47.jpg)
CMU 15-721 (Spring 2018)
ISSUES
Gathers and scatters are not really executed in parallel because the L1 cache only allows one or two distinct accesses per cycle.
Gathers are only supported in newer CPUs.
Selective loads and stores are also emulated in Xeon CPUs using vector permutations.
30
![Page 48: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/48.jpg)
CMU 15-721 (Spring 2018)
VECTORIZED OPERATORS
Selection Scans
Hash Tables
Partitioning
Paper provides additional info:→ Joins, Sorting, Bloom filters.
31
RETHINKING SIMD VECTORIZATION FOR IN-MEMORY DATABASESSIGMOD 2015
![Page 49: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/49.jpg)
CMU 15-721 (Spring 2018)
SELECTION SCANS
32
SELECT * FROM tableWHERE key >= $(low)AND key <= $(high)
![Page 50: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/50.jpg)
CMU 15-721 (Spring 2018)
SELECTION SCANS
32
Scalar (Branching)
i = 0for t in table:
key = t.keyif (key≥low) && (key≤high):
copy(t, output[i])i = i + 1
![Page 51: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/51.jpg)
CMU 15-721 (Spring 2018)
SELECTION SCANS
32
Scalar (Branching)
i = 0for t in table:
key = t.keyif (key≥low) && (key≤high):
copy(t, output[i])i = i + 1
Scalar (Branchless)
i = 0for t in table:
copy(t, output[i])key = t.keym = (key≥low ? 1 : 0) &&↪(key≤high ? 1 : 0)
i = i + m
![Page 52: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/52.jpg)
CMU 15-721 (Spring 2018)
SELECTION SCANS
32
Scalar (Branching)
i = 0for t in table:
key = t.keyif (key≥low) && (key≤high):
copy(t, output[i])i = i + 1
Scalar (Branchless)
i = 0for t in table:
copy(t, output[i])key = t.keym = (key≥low ? 1 : 0) &&↪(key≤high ? 1 : 0)
i = i + m
Source: Bogdan Raducanu
![Page 53: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/53.jpg)
CMU 15-721 (Spring 2018)
SELECTION SCANS
33
Vectorized
i = 0for vt in table:
simdLoad(vt.key, vk)vm = (vk≥low ? 1 : 0) &&↪(vk≤high ? 1 : 0)
simdStore(vt, vm, output[i])i = i + |vm≠false|
![Page 54: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/54.jpg)
CMU 15-721 (Spring 2018)
SELECTION SCANS
33
Vectorized
i = 0for vt in table:
simdLoad(vt.key, vk)vm = (vk≥low ? 1 : 0) &&↪(vk≤high ? 1 : 0)
simdStore(vt, vm, output[i])i = i + |vm≠false|
ID1
KEYJ
2 O3 Y4 S5 U6 X
SELECT * FROM tableWHERE key >= "O" AND key <= "U"
![Page 55: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/55.jpg)
CMU 15-721 (Spring 2018)
SELECTION SCANS
33
Vectorized
i = 0for vt in table:
simdLoad(vt.key, vk)vm = (vk≥low ? 1 : 0) &&↪(vk≤high ? 1 : 0)
simdStore(vt, vm, output[i])i = i + |vm≠false|
J O Y S U XKey VectorID1
KEYJ
2 O3 Y4 S5 U6 X
Mask 0 1 0 1 1 0
SIMD Compare
SELECT * FROM tableWHERE key >= "O" AND key <= "U"
![Page 56: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/56.jpg)
CMU 15-721 (Spring 2018)
SELECTION SCANS
33
Vectorized
i = 0for vt in table:
simdLoad(vt.key, vk)vm = (vk≥low ? 1 : 0) &&↪(vk≤high ? 1 : 0)
simdStore(vt, vm, output[i])i = i + |vm≠false|
J O Y S U XKey VectorID1
KEYJ
2 O3 Y4 S5 U6 X
Mask 0 1 0 1 1 0
SIMD Compare
0 1 2 3 4 5All Offsets
SELECT * FROM tableWHERE key >= "O" AND key <= "U"
![Page 57: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/57.jpg)
CMU 15-721 (Spring 2018)
SELECTION SCANS
33
Vectorized
i = 0for vt in table:
simdLoad(vt.key, vk)vm = (vk≥low ? 1 : 0) &&↪(vk≤high ? 1 : 0)
simdStore(vt, vm, output[i])i = i + |vm≠false|
J O Y S U XKey VectorID1
KEYJ
2 O3 Y4 S5 U6 X
Mask 0 1 0 1 1 0
SIMD Compare
0 1 2 3 4 5All Offsets
SIMD Store
1 3 4Matched OffsetsSELECT * FROM tableWHERE key >= "O" AND key <= "U"
![Page 58: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/58.jpg)
CMU 15-721 (Spring 2018)
SELECTION SCANS
34
0
16
32
48
0 1 2 5 10 20 50 100
Thr
ough
put
(bil
lion
tupl
es /
sec
)
Selectivity (%)
Scalar (Branching)
Scalar (Branchless)
Vectorized (Early Mat)
Vectorized (Late Mat)
0.0
2.0
4.0
6.0
0 1 2 5 10 20 50 100
Thr
ough
put
(bil
lion
tupl
es /
sec
)Selectivity (%)
MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)
5.7 5.6 5.35.7 4.9 4.3 2.8 1.3
1.7 1.7 1.71.81.6 1.4 1.5
1.2
![Page 59: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/59.jpg)
CMU 15-721 (Spring 2018)
SELECTION SCANS
34
0
16
32
48
0 1 2 5 10 20 50 100
Thr
ough
put
(bil
lion
tupl
es /
sec
)
Selectivity (%)
Scalar (Branching)
Scalar (Branchless)
Vectorized (Early Mat)
Vectorized (Late Mat)
0.0
2.0
4.0
6.0
0 1 2 5 10 20 50 100
Thr
ough
put
(bil
lion
tupl
es /
sec
)Selectivity (%)
MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)
MemoryBandwidth
MemoryBandwidth
![Page 60: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/60.jpg)
CMU 15-721 (Spring 2018)
PAYLOADKEY
Linear Probing Hash Table
HASH TABLES PROBING
35
Scalar
k1
Input Key
h1
Hash Index
#hash(key)
![Page 61: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/61.jpg)
CMU 15-721 (Spring 2018)
PAYLOADKEY
Linear Probing Hash Table
HASH TABLES PROBING
35
Scalar
k1
Input Key
h1
Hash Index
#hash(key)
k1 k9=
![Page 62: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/62.jpg)
CMU 15-721 (Spring 2018)
PAYLOADKEY
Linear Probing Hash Table
HASH TABLES PROBING
35
Scalar
k1
Input Key
h1
Hash Index
#hash(key)
k1
k9=
k3=
k8=
k1=
![Page 63: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/63.jpg)
CMU 15-721 (Spring 2018)
HASH TABLES PROBING
35
Scalar
k1
Input Key
h1
Hash Index
#hash(key)
Vectorized (Horizontal)
KEYS PAYLOAD
Linear Probing Bucketized Hash Table
k1
Input Key
h1
Hash Index
#hash(key)
![Page 64: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/64.jpg)
CMU 15-721 (Spring 2018)
HASH TABLES PROBING
35
Scalar
k1
Input Key
h1
Hash Index
#hash(key)
Vectorized (Horizontal)
KEYS PAYLOAD
Linear Probing Bucketized Hash Table
k1
Input Key
h1
Hash Index
#hash(key)
k9= k3 k8 k1k1
0 0 0 1Matched Mask
SIMD Compare
![Page 65: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/65.jpg)
CMU 15-721 (Spring 2018)
PAYLOAD
k99
k1
k6
k4
KEY
k5
k88
Linear Probing Hash Table
HASH TABLES PROBING
36
Vectorized (Vertical)Input Key
Vector hash(key)
#
#
#
#
Hash IndexVector
h1
h2
h3
h4
k1
k2
k3
k4
![Page 66: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/66.jpg)
CMU 15-721 (Spring 2018)
PAYLOAD
k99
k1
k6
k4
KEY
k5
k88
Linear Probing Hash Table
HASH TABLES PROBING
36
Vectorized (Vertical)Input Key
Vector hash(key)
#
#
#
#
Hash IndexVector
h1
h2
h3
h4
k1
k2
k3
k4
k1
k99
k88
k4
====
SIMD Gather
k1
k2
k3
k4
![Page 67: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/67.jpg)
CMU 15-721 (Spring 2018)
PAYLOAD
k99
k1
k6
k4
KEY
k5
k88
Linear Probing Hash Table
HASH TABLES PROBING
36
Vectorized (Vertical)Input Key
Vector hash(key)
#
#
#
#
Hash IndexVector
h1
h2
h3
h4
k1
k2
k3
k4
k1
k99
k88
k4
====
SIMD Compare
1
0
0
1
k1
k2
k3
k4
![Page 68: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/68.jpg)
CMU 15-721 (Spring 2018)
PAYLOAD
k99
k1
k6
k4
KEY
k5
k88
Linear Probing Hash Table
HASH TABLES PROBING
36
Vectorized (Vertical)Input Key
Vector hash(key)
#
#
#
#
Hash IndexVector
h1
h2
h3
h4
k1
k2
k3
k4
k1
k99
k88
k4
====
SIMD Compare
1
0
0
1
k1
k2
k3
k4
k5
k6
h5
h2+1
h3+1
h6
![Page 69: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/69.jpg)
CMU 15-721 (Spring 2018)
PAYLOAD
k99
k1
k6
k4
KEY
k5
k88
Linear Probing Hash Table
HASH TABLES PROBING
36
Vectorized (Vertical)Input Key
Vector hash(key)
#
#
#
#
Hash IndexVector
h1
h2
h3
h4
k1
k2
k3
k4
k5
k6
h5
h2+1
h3+1
h6
![Page 70: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/70.jpg)
CMU 15-721 (Spring 2018)
HASH TABLES PROBING
37
0
3
6
9
12
Thr
ough
put
(bil
lion
tupl
es /
sec
)
Hash Table Size
Scalar Vectorized (Horizontal) Vectorized (Vertical)
MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)
0
0.5
1
1.5
2
Thr
ough
put
(bil
lion
tupl
es /
sec
)
Hash Table Size
2.3 2.2 2.12.41.1 0.9 0.7 0.6
1.1 1.10.9
1.2
0.8 0.8
0.30.2
![Page 71: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/71.jpg)
CMU 15-721 (Spring 2018)
HASH TABLES PROBING
37
0
3
6
9
12
Thr
ough
put
(bil
lion
tupl
es /
sec
)
Hash Table Size
Scalar Vectorized (Horizontal) Vectorized (Vertical)
MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)
0
0.5
1
1.5
2
Thr
ough
put
(bil
lion
tupl
es /
sec
)
Hash Table Size
Out of Cache
Out of Cache
![Page 72: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/72.jpg)
CMU 15-721 (Spring 2018)
PARTITIONING HISTOGRAM
Use scatter and gathers to increment counts.
Replicate the histogram to handle collisions.
38
k1
k2
k3
k4
Input KeyVector
h1
h2
h3
h4
Hash Index Vector
SIMD AddSIMD Radix
+1
+1
+1
Histogram
![Page 73: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/73.jpg)
CMU 15-721 (Spring 2018)
PARTITIONING HISTOGRAM
Use scatter and gathers to increment counts.
Replicate the histogram to handle collisions.
38
k1
k2
k3
k4
Input KeyVector
h1
h2
h3
h4
Hash Index Vector Replicated Histogram
+1
+1
+1
+1
# of Vector Lanes
SIMD Radix SIMD Scatter
![Page 74: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/74.jpg)
CMU 15-721 (Spring 2018)
PARTITIONING HISTOGRAM
Use scatter and gathers to increment counts.
Replicate the histogram to handle collisions.
38
k1
k2
k3
k4
Input KeyVector
h1
h2
h3
h4
Hash Index Vector Replicated Histogram
+1
+1
+1
+1
SIMD Add
# of Vector Lanes
SIMD Radix
+1
+2
+1
Histogram
SIMD Scatter
![Page 75: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/75.jpg)
CMU 15-721 (Spring 2018)
JOINS
No Partitioning→ Build one shared hash table using atomics→ Partially vectorized
Min Partitioning→ Partition building table→ Build one hash table per thread→ Fully vectorized
Max Partitioning→ Partition both tables repeatedly→ Build and probe cache-resident hash tables→ Fully vectorized
39
![Page 76: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/76.jpg)
CMU 15-721 (Spring 2018)
JOINS
40
0
0.5
1
1.5
2
Scalar Vector Scalar Vector Scalar VectorNo Partitioning Min Partitioning Max Partitioning
Join
Tim
e (s
ec)
Partition Build Probe Build+Probe
200M ⨝ 200M tuples (32-bit keys & payloads)Xeon Phi 7120P – 61 Cores + 4×HT
![Page 77: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/77.jpg)
CMU 15-721 (Spring 2018)
PARTING THOUGHTS
Vectorization is essential for OLAP queries.
These algorithms don’t work when the data exceeds your CPU cache.
We can combine all the intra-query parallelism optimizations we’ve talked about in a DBMS.→ Multiple threads processing the same query.→ Each thread can execute a compiled plan.→ The compiled plan can invoke vectorized operations.
41
![Page 78: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can](https://reader034.vdocuments.net/reader034/viewer/2022051923/6010eeee4702cd431003c14b/html5/thumbnails/78.jpg)
CMU 15-721 (Spring 2018)
NEXT CL ASS
Vectorization (Part II)
42