Download - yaSpMV: Yet Another SpMV Framework on GPUs
yaSpMV: Yet Another SpMV Framework on GPUs
Shengen Yan, Chao Li, Yunquan Zhang, Huiyang Zhou
Introduction• Sparse Matrix-Vector Multiplication
– spmv is a very important linear algebra algorithm– Serial implementation is quite simple
// A*x=y, where A is stored in the CSR format. for (i = 0; i < m; ++i) { double y0 = y[i]; for (k = rowptr[i]; k < rowptr[i+1]; ++k) y0 = value[k] * x[column_index[k]];
y[i] = y0; }
– There are many work involved in its optimization on both CPUs and GPUs
• Many formats have been proposed.
Introduction• Parallel implementation: two challenges
– Bandwidth • the upper bound of flop:byte ratio is 0.25
– Load imbalance • Different number of non-zeroes in different rows• Worse on GPUs
𝐴=[0 0 ¿0 0 0 ¿0 0 5 1 0 ¿
4¿0¿0¿0¿0¿0¿7 ¿2¿3¿5¿0¿0¿¿¿¿¿ ]3 6 9
4 7 1 3 8 4
x =
Executive Summary
• BCCOO format– addressing the bandwidth challenge
• Customized efficient segmented scan/sum– addressing load imbalance problem– very efficient
• Results (GTX 680)– vs. CUSPARSE V5.0
• up to 229% and 65% on average improvement– vs. clSpMV
• up to 195% and 70% on average improvement
Outline• Introduction• Formats for SpMV
– addressing the bandwidth challenge• Efficient Segmented Sum/Scan for SpMV• Auto-Tuning Framework• Experimentation• Conclusions
COO format
𝐴=[ 0 0 3 0 0 0 6 90 0 5 1 0 0 4 00 0 0 0 7 2 3 54 7 0 0 1 3 8 4 ]
COO format of matrix A
[3 6 9 5 1 4 7 2 3 5 4 7 1 3 8 4][0 0 0 1 1 1 2 2 2 2 3 3 3 3 3 3][2 6 7 2 3 6 4 5 6 7 0 1 4 5 6 7]
Blocked COO (BCOO) format
𝐴=[ 0 0 3 0 0 0 6 90 0 5 1 0 0 4 00 0 0 0 7 2 3 54 7 0 0 1 3 8 4 ]
Blocked COO (BCOO) format
𝐴=[ 0 0 3 0 0 0 6 90 0 5 1 0 0 4 00 0 0 0 7 2 3 54 7 0 0 1 3 8 4 ]
[ ] [ ] BCOO format block size 2x2
3 05 1
01
Blocked COO (BCOO) format
𝐴=[ 0 0 3 0 0 0 6 90 0 5 1 0 0 4 00 0 0 0 7 2 3 54 7 0 0 1 3 8 4 ]
[ ] [ ] BCOO format block size 2x2
3 05 1
01
6 94 0
03
Blocked COO (BCOO) format
𝐴=[ 0 0 3 0 0 0 6 90 0 5 1 0 0 4 00 0 0 0 7 2 3 54 7 0 0 1 3 8 4 ]
[ ] [ ] BCOO format block size 2x2
3 05 1
01
6 94 0
03
0 04 7
7 21 3
3 58 4
10
12
13
Blocked compressed COO (BCCOO) format
[ ] Difference value =[ ] Bit Flag (flipped)=[ ] [ ] BCCOO format block size 2x2
3 05 1
0
1
6 94 0
0
3
0 04 7
7 21 3
3 58 4
1
0
1
2
1
3
𝐴=[ 0 0 3 0 0 0 6 90 0 5 1 0 0 4 00 0 0 0 7 2 3 54 7 0 0 1 3 8 4 ]
0 1 0 0 11 0 1 1 0
Integer Bit
Row Index Compression Ratio: 1/32
Formats for SpMV• Extensions of BCCOO format
– BCCOO+ format• Rearrange the non-zeros blocks.• Relief the irregular access to the vector
– Column index compression• Using difference function on the column index.
13
Example matrix• Assume there are 4 threads
𝐵=[5 0 3 0 2 0 6 90 0 0 1 0 0 4 00 4 0 8 0 2 0 00 7 6 1 0 3 8 4 ]
[1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 0]BCCOO format of matrix B (Block size 1x1)
Auxiliary Information for SpMV: Result Entry• Getting the location of the first result generated by
each thread in the output array. That’s to say to compute the row index that the first result in each thread belongs to.
• Only need to count the zero number in the bit flag array of the previous threads.
[1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 0]Thread 0 Thread 1 Thread 2 Thread 3
0 0 2 3
Outline• Introduction• Formats for SpMV• Efficient Segmented Sum/Scan for SpMV
– addressing load imbalance problem• Auto-Tuning Framework• Experimentation• Conclusions
Even workload partition• No workload imbalance
Non-zero Blocks
workgroups workgroup 1 workgroup 2 workgroup 3workgroup 0
threads
T0 T1 T2 T3
Efficient Segmented Sum/Scan for SpMV• Three logic steps
– Read the data and multiply them with the corresponding vector values.
– Perform a segmented sum/scan using the bit flag array from our BCCOO/BCCOO+ format
– Results combination and write back the results to global memory.
• All these three steps are implemented in one kernel
Step 1 Read the data and multiply with vector values
𝐵=[5 0 3 0 2 0 6 90 0 0 1 0 0 4 00 4 0 8 0 2 0 00 7 6 1 0 3 8 4 ]
[1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 0]
BCCOO format of matrix B
• Ex: 4 Threads
Step 1 Read the data and multiply with vector values • Ex: 4 Threads
Problem: B*x =? Assume: x=[2 9 6 5 4 8 7 3]
X
=2 6 4 7
0 2 4 6
357958 965873
10188 422752836 401663365245612
5326914 4 82761384
Step 2 Segmented sum/scan• Three types of rows in our algorithm
– All the non-zeros of the row are in the same thread.• Serial segmented sum/scan in threads
– A row spans multiple threads.• + Parallel segmented sum/scan among threads.
– A row spans multiple workgroups.• + Cross workgroup synchronization (details in paper)
21
Step 2 Segmented sum/scan• 1) Serial segmented sum/scan in each thread
Problem: B*x =? Assume: x=[2 9 6 5 4 8 7 3]
[ 1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 0 ]Scan Scan Scan Scan
Serial Segmented Scan (intermediate[-1]=0):Intermediate[i] = intermediate[i-1] * BitFlag[i-1] + Intermediate[i]
40566399529859733 3652710283678
Step 2 Segmented sum/scan• 2) Generate last partial sum and perform parallel segmented
scan among threads for the last partial sum
[ 1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 0 ]
Partial SumsHead Flag(Exist ‘0’ in Bit Flag?)
Partial Sums
Scan Scan Scan Scan
78 36 99 00 1 1 1
Parallel segmented scan
78 36 99 0
22
Step 2 Segmented sum/scan• 2) Generate last partial sum and perform parallel segmented
scan among threads for the last partial sum
[ 1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 0 ]
Partial SumsHead Flag(Exist ‘0’ in Bit Flag?)
Partial Sums
Scan Scan Scan Scan
78 36 99 00 1 1 1
Parallel segmented scan
78 36 99 0
23
1
0
135
Step 3 Results combination and write the results to global memory.
[ 1 1 1 1 1 1 1 1 1 1 1 1 ] Partial Sums 78 36 99 0
Combined results
+ ++
0 0 0 0
27 33 56 97
105 33 92 196
Problem: B*x =? Assume: x=[2 9 6 5 4 8 7 3]
𝑅𝑒𝑠𝑢𝑙𝑡 𝐸𝑛𝑡𝑟𝑦 0
𝐅𝐢𝐧𝐚𝐥𝐑𝐞𝐬𝐮𝐥𝐭 [ , ,, ]
0 2 3
105 92 19633
0+1
Auto-Tuning Framework• In order to generate the optimal kernel code, we employ the
auto-tuning technique to search the best parameters.
Tunable parameters
Average auto-tuning time: 13 secondsAuto-tuning speed: ~1 million non-zeros per seconds
26
Experiments• Experimental Methodology
– We have implemented our proposed scheme in OpenCL.– We have evaluated our scheme on GTX 480 and GTX 680
using 20 real world matrices. – Comparison library
• CUSPARSE V5.0 (Nvidia official SpMV library)• CUSP (SC 09)• clSpMV (ICS 12)
Used Matrices
Name Size Non-zeros(NNZ) NNZ/Row
Dense 2K * 2K 4M 2000
Protein 36K * 36K 4.3M 119
FEM/Spheres 83K * 83K 6M 72
FEM/Cantilever 62K * 62K 4M 65Wind Tunnel 218K*218K 11M 53FEM/Harbor 47K * 47K 2.3M 59
QCD 49K * 49K 1.9M 39FEM/Ship 141K*141K 7.8M 28Economics 207K*207K 1.2M 6
Epidemiology 526K*526K 2.1M 4FEM/Accelerator 121K*121K 2.6M 22
Circuit 171K*171K 0.95M 6
Webbase 1M * 1M 3.1M 3
LP 4K * 1.1M 11.3M 2825
Circuit5M 5.56M* 5.56M 59.5M 11
eu-2005 863K*863K 19.2M 22Ga41As41H72 268K*268K 18.4M 67
in-2004 1.38M*1.38M
17M 12
mip1 66K * 66K 10.3M 152
Si41Ge41H72 186K*186K 15M 81
28Performance results on Kepler (GTX 680)
Dense
Protein
FEM/Sp
heres
FEM/Can
tilever
Wind Tunnel
FEM/H
arbor
QCD
FEM/Sh
ip
Economics
Epidem
iology
FEM/A
cceler
atorCirc
uit
Webbase LP
Circuit5
M
eu-2005
Ga41As41H72
in-2004mip1
Si41Ge4
1H72
H-mea
n0
10
20
30
40
50
60
70
CUSPARSE
GFLO
PS
29Performance results on Kepler (GTX 680)
Dense
Protein
FEM/Sp
heres
FEM/Can
tilever
Wind Tunnel
FEM/H
arbor
QCD
FEM/Sh
ip
Economics
Epidem
iology
FEM/A
cceler
atorCirc
uit
Webbase LP
Circuit5
M
eu-2005
Ga41As41H72
in-2004mip1
Si41Ge4
1H72
H-mea
n0
10
20
30
40
50
60
70
CUSPARSE CUSP
GFLO
PS
30Performance results on Kepler (GTX 680)
Dense
Protein
FEM/Sp
heres
FEM/Can
tilever
Wind Tunnel
FEM/H
arbor
QCD
FEM/Sh
ip
Economics
Epidem
iology
FEM/A
cceler
atorCirc
uit
Webbase LP
Circuit5
M
eu-2005
Ga41As41H72
in-2004mip1
Si41Ge4
1H72
H-mea
n0
10
20
30
40
50
60
70
CUSPARSE CUSP clSpMV Cocktail
GFLO
PS
31Performance results on Kepler (GTX 680)
Dense
Protein
FEM/Sp
heres
FEM/Can
tilever
Wind Tunnel
FEM/H
arbor
QCD
FEM/Sh
ip
Economics
Epidem
iology
FEM/A
cceler
atorCirc
uit
Webbase LP
Circuit5
M
eu-2005
Ga41As41H72
in-2004mip1
Si41Ge4
1H72
H-mea
n0
10
20
30
40
50
60
70
CUSPARSE CUSP clSpMV Cocktail clSpMV Best single
GFLO
PS
32Performance results on Kepler (GTX 680)
Dense
Protein
FEM/Sp
heres
FEM/Can
tilever
Wind Tunnel
FEM/H
arbor
QCD
FEM/Sh
ip
Economics
Epidem
iology
FEM/A
cceler
atorCirc
uit
Webbase LP
Circuit5
M
eu-2005
Ga41As41H72
in-2004mip1
Si41Ge4
1H72
H-mea
n0
10
20
30
40
50
60
70
CUSPARSE CUSP clSpMV CocktailclSpMV Best single yaSpMV
GFLO
PS
Average Performance Improvement:65% over CUSPARSE, 70% over clSpMV COCKTAIL,88% over clSpMV best single, 150% over CUSP
33Performance breakdown on Kepler (GTX680)
Dense
Protein
FEM/Sp
heres
FEM/Can
tilever
Wind Tunnel
FEM/H
arbor
QCD
FEM/Sh
ip
Economics
Epidem
iology
FEM/A
cceler
atorCirc
uit
Webbase LP
Circuit5
M
eu-2005
Ga41As41H72
in-2004mip1
Si41Ge4
1H72
H-mea
n0
10
20
30
40
50
60
70
COO
GFLO
PS
34Performance breakdown on Kepler (GTX680)
Dense
Protein
FEM/Sp
heres
FEM/Can
tilever
Wind Tunnel
FEM/H
arbor
QCD
FEM/Sh
ip
Economics
Epidem
iology
FEM/A
cceler
atorCirc
uit
Webbase LP
Circuit5
M
eu-2005
Ga41As41H72
in-2004mip1
Si41Ge4
1H72
H-mea
n0
10
20
30
40
50
60
70
COO BCCOO
GFLO
PS
35Performance breakdown on Kepler (GTX680)
Dense
Protein
FEM/Sp
heres
FEM/Can
tilever
Wind Tunnel
FEM/H
arbor
QCD
FEM/Sh
ip
Economics
Epidem
iology
FEM/A
cceler
atorCirc
uit
Webbase LP
Circuit5
M
eu-2005
Ga41As41H72
in-2004mip1
Si41Ge4
1H72
H-mea
n0
10
20
30
40
50
60
70
COO BCCOO Efficient Segmented Sum/Scan
GFLO
PS
36Performance breakdown on Kepler (GTX680)
Dense
Protein
FEM/Sp
heres
FEM/Can
tilever
Wind Tunnel
FEM/H
arbor
QCD
FEM/Sh
ip
Economics
Epidem
iology
FEM/A
cceler
atorCirc
uit
Webbase LP
Circuit5
M
eu-2005
Ga41As41H72
in-2004mip1
Si41Ge4
1H72
H-mea
n0
10
20
30
40
50
60
70
COO BCCOOEfficient Segmented Sum/Scan Adjacent Synchronization
GFLO
PS
37Performance breakdown on Kepler (GTX680)
Dense
Protein
FEM/Sp
heres
FEM/Can
tilever
Wind Tunnel
FEM/H
arbor
QCD
FEM/Sh
ip
Economics
Epidem
iology
FEM/A
cceler
atorCirc
uit
Webbase LP
Circuit5
M
eu-2005
Ga41As41H72
in-2004mip1
Si41Ge4
1H72
H-mea
n0
10
20
30
40
50
60
70COO BCCOO Efficient Segmented Sum/Scan Adjacent Synchronization Fine-Grain Optimizations
GFLO
PS
Average Performance Improvement (vs. COO format):+BCCOO: 66%+Efficient Segmented Sum/Scan: 192%+Adjacent Synchronization: 212%+Fine-Grain Optimizations: 257%
38Relative memory footprint of different formats
Dense
Protein
FEM/Sp
heres
FEM/Can
tilever
Wind Tunnel
FEM/H
arbor
QCD
FEM/Sh
ip
Economics
Epidem
iology
FEM/A
cceler
atorCirc
uit
Webbase LP
Circuit5
M
eu-2005
Ga41As41H72
in-2004mip1
Si41Ge4
1H72
Averag
e0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
COO ELL Cocktail Best single BCCOO
Rela
tive
mem
ory
foot
prin
t
Average memory footprint consumption: vs. COO: 60%vs. ELL: 19%vs. Cocktail:79%vs. Best single: 69%
39
Conclusions• The BCCOO format
– Addressed the memory bandwidth problem• The customized matrix-based segmented sum/scan
algorithms– Addressed the work load imbalance problem– Only need to invoke one kernel.– Very efficient: used a lot of optimization approaches.
• Results (GTX 680)– Vs. CUSPARSE V5.0
• up to 229% and 65% on average improvement– Vs. clSpMV
• up to 195% and 70% on average improvement
• Code is available online– http://code.google.com/p/yaspmv/
40
Thanks & Question?
41
COO format
𝐴=[ 0 0 3 0 0 0 6 90 0 5 1 0 0 4 00 0 0 0 7 2 3 54 7 0 0 1 3 8 4 ]
42
COO format
𝐴=[ 0 0 3 0 0 0 6 90 0 5 1 0 0 4 00 0 0 0 7 2 3 54 7 0 0 1 3 8 4 ]
[ ][ ][ ]
302
43
COO format
𝐴=[ 0 0 3 0 0 0 6 90 0 5 1 0 0 4 00 0 0 0 7 2 3 54 7 0 0 1 3 8 4 ]
[ ][ ][ ]
3 602
06
44
Step 2 Segmented sum/scan• 3) Accumulating partial sums across workgroups
Generate Partial Sums
Generate Partial Sums
Generate Partial Sums
Generate Partial Sums
Step 3 Step 3 Step 3 Step 3
P0 P1 P2 P3
Using Adjacent Synchronization
45
Fine-grained optimizations• Texture memory for vector read• Cut the adjacent synchronization chain as early as
possible• Remove the parallel segmented scan if possible• If the number of columns is smaller than 65535
short type column index array may be helpful to decrease the memory traffic.
46Performance results on Fermi (GTX480)
Dense
Protein
FEM/Sp
heres
FEM/Can
tilever
Wind Tunnel
FEM/H
arbor
QCD
FEM/Sh
ip
Economics
Epidem
iology
FEM/A
cceler
atorCirc
uit
Webbase LP
Circuit5
M
eu-2005
Ga41As41H72
in-2004mip1
Si41Ge4
1H72
H-mea
n0
10
20
30
40
50
60
70
CUSPARSE CUSP clSpMV CocktailclSpMV Best singel yaSpMV
GFLO
PS
Average Performance Improvement:42% over CUSPARSE, 40% over clSpMV COCKTAIL,60% over clSpMV best single, 74% over CUSP
47
Absolute memory footprint consumption of COO,BCOO,BCCOO formats
Dense
Protein
FEM/Sp
heres
FEM/Can
tilever
Wind Tunnel
FEM/H
arbor
QCD
FEM/Sh
ip
Economics
Epidem
iology
FEM/A
cceler
atorCirc
uit
Webbase LP
Circuit5
M
eu-2005
Ga41As41H72
in-2004mip1
Si41Ge4
1H72
Averag
e0
100000000
200000000
300000000
400000000
500000000
600000000
700000000
800000000
COO BCOO BCCOO