accurate blas implementations: ozblasand blas-dot2 · 2020. 2. 26. · (fp16, fp128) fast accurate...
TRANSCRIPT
Accurate BLAS Implementations:OzBLAS and BLAS-DOT2
LSPANC2020January, January 30, 2020, R-CCSDaichi Mukunoki (R-CCS)
Research Collaborators:Takeshi Ogita (Tokyo Woman’s Christian University)Katsuhisa Ozaki (Shibaura Institute of Technology)
Acknowledgement:This research was partially supported by MEXT as “Exploratory Issue on Post-K computer” (Development of verified numericalcomputations and super high performance computing environment for extreme researches) and the Japan Society for thePromotion of Science (JSPS) KAKENHI Grant Number 19K20286.
Fast Accurate Multi-Precision Numerical Libraryl Needs in Minimal-Precision Computing
1. Fast computation of arbitrary-precision (incl. accurate) result on CPU/GPU2. Accurate numerical validation (or verification)
2
Hardware
Heterogeneous System
ArithmeticLibrary
Numerical Libraries
NumericalValidation
Precision Tuning
Arbitrary- Precision
FP32, FP64,(FP16, FP128)
Fast AccurateMethods / Libraries
others…LAPACK
BLAS
MPFR
Stochastic ArithmeticSAMCADNA
PROMISE with SAM
PROMISE
DD/QD
FPGA
acceleration
GPUGPUNymble
FloPoCo
available
in development
SPGenfor arbitrary-prec.
Compiler & Tools for FPGA
CPUacceleration
others…OzBLASExBLAS
QPEigenMPLAPACK (MPBLAS)
Accurate BLAS Implementations
3
Platform DataPrecision
ComputationPrecision
BLAS (netlib) CPU, GPU FP32/64 FP32/64
XBLAS (netlib) CPU FP32/64 FP32/64/2xFP64
BLAS-DOT2 (TWCU) GPU FP64 2xFP64
ReproBLAS (Berkeley) CPU FP64 Tunable
OzBLAS (TWCU/RIKEN) CPU, GPU FP64 Tunable (correct-rounding)
ExBLAS (Sorbonne U.) CPU, GPU FP64 Correct rounding
QPBLAS (JAEA/RIKEN) CPU, GPU DD DD
MBLAS / MPLAPACK
(M. Nakata)CPU
FP64/DD/
QD,MPFRFP64/DD/QD/MPFR
…
Examples of accurate BLAS implementations
Red: our work
Accurate BLAS Implementations
4
Platform DataPrecision
ComputationPrecision
BLAS (netlib) CPU, GPU FP32/64 FP32/64
XBLAS (netlib) CPU FP32/64 FP32/64/2xFP64
BLAS-DOT2 (TWCU) GPU FP64 2xFP64
ReproBLAS (Berkeley) CPU FP64 Tunable
OzBLAS (TWCU/RIKEN) CPU, GPU FP64 Tunable (correct-rounding)
ExBLAS (Sorbonne U.) CPU, GPU FP64 Correct rounding
QPBLAS (JAEA/RIKEN) CPU, GPU DD DD
MBLAS / MPLAPACK
(M. Nakata)CPU
FP64/DD/
QD,MPFRFP64/DD/QD/MPFR
…
Examples of accurate BLAS implementations
Red: our work
BLAS-DOT2n BLAS-DOT2 [1]
l Input/output: FP64, computation: 2x FP64 (same as XBLAS)l Based on a 2-fold precision inner-product algorithm: “Dot2” [2]
¥ Based on error-free addition & multiplication, similar to double-double¥ Accuracy depends on the condition number
l Theoretical performance overhead compared to cuBLAS FP64 routines¥ DOT & GEMV: no overhead, GEMM: 11x slower
5
[1] Daichi Mukunoki, Takeshi Ogita: Performance and Energy Consumption of Accurate and Mixed-precision Linear Algebra Kernels on GPUs, Journal of Computational and Applied Mathematics, Vol. 372, p. 112701, 2020.[2] T. Ogita, S. M. Rump, S. Oishi, Accurate Sum and Dot Product, SIAM J. Scientific Computing 26 (2005) 1955–1988. Code: http://www.math.twcu.ac.jp/ogita/post-k/results.html
Accurate Inner-Product: Dot2 [Ogita et al. 2005]
6
Notation:• fl (…): computation performed with floating-point arithmetic (in our case FP64)• fma (…): computation performed with fused multiply-add• ulp (a): unit in the last place
Accurate inner-product “Dot2” [Ogita et al. 2005]: r = xTy, r, x, y ∈ Fn
function r = Dot2 (x, y)[p, s] = TwoProd (x0, y0)for i = 1 : n-1
[h, r] = TwoProd (xi, yi)[p, q] = TwoSum (p, h)s = fl (s + (q + r))
r = fl (p + s)end function
Accurate Inner-Product: Dot2 [Ogita et al. 2005]
7
function [s, e] = TwoSum (a, b)s = fl (a + b)v = fl (s - a)e = fl (a - (s - v)) + (b - v))
end function
function [p, e] = TwoProd (a, b)p = fl (a×b)e = fma (a×b - p)
end function
Error-free transformation for addition [Knuth 1969]: s + e = a + b, s = fl (a + b)
Error-free transformation for multiplication [Karp et al. 1997]: p + e = a × b, p = fl (a × b)
Notation:• fl (…): computation performed with floating-point arithmetic (in our case FP64)• fma (…): computation performed with fused multiply-add• ulp (a): unit in the last place
function r = Dot2 (x, y)[p, s] = TwoProd (x0, y0)for i = 1 : n-1
[h, r] = TwoProd (xi, yi)[p, q] = TwoSum (p, h)s = fl (s + (q + r))
r = fl (p + s)end function
Accurate inner-product “Dot2” [Ogita et al. 2005]: r = xTy, r, x, y ∈ Fn
Accurate Inner-Product: Dot2 [Ogita et al. 2005]
8
function [s, e] = TwoSum (a, b)s = fl (a + b)v = fl (s - a)e = fl (a - (s - v)) + (b - v))
end function
function [p, e] = TwoProd (a, b)p = fl (a×b)e = fma (a×b - p)
end function
Error-free transformation for addition [Knuth 1969]: s + e = a + b, s = fl (a + b)
Error-free transformation for multiplication [Karp et al. 1997]: p + e = a × b, p = fl (a × b)
Notation:• fl (…): computation performed with floating-point arithmetic (in our case FP64)• fma (…): computation performed with fused multiply-add• ulp (a): unit in the last place
function r = Dot2 (x, y)[p, s] = TwoProd (x0, y0)for i = 1 : n-1
[h, r] = TwoProd (xi, yi)[p, q] = TwoSum (p, h)s = fl (s + (q + r))
r = fl (p + s)end function
Accurate inner-product “Dot2” [Ogita et al. 2005]: r = xTy, r, x, y ∈ Fn
11 instructions(11x overhead thanFP64 DOT using FMA)
Implementationn Matrix multiplication (GEMM) on CUDA
l Each thread computes an element of matrix C using Dot2l 2D thread-block and 2D grid utilizing 2D data locality on matrixl Shared-memory blocking l To achieve nearly the theoretical peak performance is not so hard thanks to the high
compute-intensity (11x higher than double)
9
k
m
k
n
Matrix A
Matrix B
Matrix C
16 elements16 threads
16 elements16 threads
Shared-memoryA thread-blockcomputesa block
A thread computesan element in a block
Performance on Titan V
10
Throughput (Ops/s) on Titan V (Volta), CUDA 10.0
0
1
2
3
4
5
6
7
0 2048 4096
6144 8192
TOps
/s
Problem Size (m=n=k)
DGEMM
cuBLAS DOT2
0 20 40 60 80
100 120 140 160 180
0 2048 4096
6144 8192
10240
GO
ps/s
Problem Size (m=n)
DGEMV
cuBLAS DOT2
Execution time overhead of Dot2 v.s. cuBLAS FP64 routine:• DOT & GEMV: Almost no overhead as memory-bound• GEMM: 11x slower as compute-bound
0
10
20
30
40
50
60
70
80
2 14 2 16 2 18 2 20 2 22 2 24
GO
ps/s
Problem Size (n)
DDOT
cuBLAS DOT2
Ops: number of mathematical computations of the BLAS operation
11x
Bet
ter
Accurate BLAS Implementations
11
Platform DataPrecision
ComputationPrecision
BLAS (netlib) CPU, GPU FP32/64 FP32/64
XBLAS (netlib) CPU FP32/64 FP32/64/2xFP64
BLAS-DOT2 (TWCU) GPU FP64 2xFP64
ReproBLAS (Berkeley) CPU FP64 Tunable
OzBLAS (TWCU/RIKEN) CPU, GPU FP64 Tunable (correct-rounding)
ExBLAS (Sorbonne U.) CPU, GPU FP64 Correct rounding
QPBLAS (JAEA/RIKEN) CPU, GPU DD DD
MBLAS / MPLAPACK
(M. Nakata)CPU
FP64/DD/
QD,MPFRFP64/DD/QD/MPFR
…
Examples of accurate BLAS implementations
Red: our work
OzBLASn OzBLAS [3]
l Input/output: FP64, computation: tunable (including correct-rounding)
l Based on the error-free transformation for dot/mat-mul: “Ozaki scheme” [3]
¥ Accuracy depends on the range of the absolute values and the dimension (inner-
product direction) of the input
l Full-scratch implementation is not needed: the kernel computation (most time-
consuming part) can be performed using standard FP64 BLAS
¥ High performance and low development cost
l Speed compared to cuBLAS depends on the input
¥ At least more than 4x (on GEMM) and 8x (on DOT & GEMV)
12
[3] D. Mukunoki, T. Ogita, K. Ozaki: Reproducible BLAS Routines with Tunable Accuracy Using Ozaki Scheme for Many-core Architectures, Proc. PPAM2019, 2019 (accepted).[4] K. Ozaki, T. Ogita, S. Oishi, S. M. Rump: Error-free transformations of matrix multiplication by using fast routines of matrix multiplication and its applications. Numer. Algorithms 59(1), 95–118 (2012) Code: http://www.math.twcu.ac.jp/ogita/post-k/results.html
Error-Free Transformation for Dot/Mat-mul: Ozaki Scheme [Ozaki et al. 2012]
13
(1) Input vectors x, y are split into several vectors:x = x(1) + x(2) + ··· + x(p) + ··· + x(sx)
y = y(1) + y(2) + ··· + y(q) + ··· + y(sy) x(p), y(q) ∈ Fn
Here, the splitting is performed to meet the following properties:1. (x(p))Ty(q) = fl ((x(p))Ty(q)) for any p and q 2. x(p)(i) and y(q)(j) are non-zeros, |x(p)(i)| ≧ |x(p+1)(i)|, |y(q)(j)| ≧ |y(q+1)(j)|
Ø The number of split vectors required to achieve correct-rounding depends on the range of the absolute values and length of the input vectors
DOT (xTy, x, y ∈ Fn) with Ozaki scheme:
Error-Free Transformation for Dot/Mat-mul: Ozaki Scheme [Ozaki et al. 2012]
14
(1) Input vectors x, y are split into several vectors:x = x(1) + x(2) + ··· + x(p) + ··· + x(sx)
y = y(1) + y(2) + ··· + y(q) + ··· + y(sy) x(p), y(q) ∈ Fn
Here, the splitting is performed to meet the following properties:1. (x(p))Ty(q) = fl ((x(p))Ty(q)) for any p and q 2. x(p)(i) and y(q)(j) are non-zeros, |x(p)(i)| ≧ |x(p+1)(i)|, |y(q)(j)| ≧ |y(q+1)(j)|
Ø The number of split vectors required to achieve correct-rounding depends on the range of the absolute values and length of the input vectors
(2) xTy is transformed to a summation of several dot-productsxTy = (x(1))Ty(1) + (x(1))Ty(2) + (x(1))Ty(3) + ··· + (x(1))Ty(sy)
+ (x(2))Ty(1) + (x(2))Ty(2) + (x(2))Ty(3) + ··· + (x(2))Ty(sy)
+ (x(3))Ty(1) + (x(3))Ty(2) + (x(3))Ty(3) + ··· + (x(3))Ty(sy)
+ ···+ (x(sx))Ty(1) + (x(sx))Ty(2) + (x(sx))Ty(3) + ··· + (x(sx))Ty(sy)
With a correctly-rounded summation (e.g. NearSum [Rump et al. 2005]), the final result achieves correct-rounding.
DOT (xTy, x, y ∈ Fn) with Ozaki scheme:
Those parts can be computed using FP64 operation (DDOT)
GEMM with Ozaki scheme
15C=Sum(C(0)+C(1)+C(2)+C(3))
Splitting of input matrices A and B
Multiplications of the split matrices by DGEMM
Summation of the computed split matrices by NearSum
A
B
B(1)B(2)
B(3)
B(sy)
A(1)A(2)
A(3)
A(sx)…
B(sy)
A(1)
A(2)
A(3)
A(sx)…
B(3)B(2)B(1)
C(1, 1) C(1, 2)
C(2,2) C(2,3)C (2, 1)
C(3, 1)
C(1,sy)C(1, 3)
C(3,2) C(3,sy)C(3, 3)
C(2,sy)
C(sx, 1) C(sx,2) C(sx,sy)C(sx,3)
…
…
… …
…
C(1, 1)C(1, 2)
C(1,p)
C
C(2,1)
…
C(2,2)
…
…sy
matrice
s
sxmatr
ices
Summati
on
Split
Split1
3
Matrix Multiplications2 Those GEMMs can be computed using standard DGEMMs
Performance on Titan V
16
Throughput (Ops/s) of Correctly-Rounded Operation on Titan V (Volta), CUDA 10.1
0
10
20
30
40
50
60
70
80
2 14 2 16 2 18 2 20 2 22 2 24
GO
ps/s
Problem Size (n)
DDOT
cuBLASOzaki(p=0)Ozaki(p=1)
0
1
2
3
4
5
6
7
0 2048 4096
6144 8192
TOps
/s
Problem Size (m=n=k)
DGEMM
cuBLASOzaki(p=0)Ozaki(p=1)
0 20 40 60 80
100 120 140 160 180
0 2048 4096
6144 8192
10240
GO
ps/s
Problem Size (m=n)
DGEMV
cuBLASOzaki(p=0)Ozaki(p=1)
Ops: number of mathematical computations of the BLAS operationInput’s exponent range (min/max): -10/-1 (p=0), -9/+2 (p=1)
Execution time overhead v.s. cuBLAS FP64 routine depends on the input(the range of the absolute values of the input vectors/matrices)
e.g., more than 10x on DOT & GEMV and 4x on GEMM in above cases at minimum
Bet
ter
Accuracy (tunable-accuracy version)
17
Input (min/max exp) φ=1 (-07/+01) φ=4 (-10/+07) φ=7 (-17/+13) φ=10 (-24/+19)
cuBLAS (double) 6.14E-10 2.95E-11 2.22E-10 2.27E-10
OzBLAS (d=2, fast)OzBLAS (d=2)
4.75E-051.98E-06
3.32E-024.61E-03
3.92E+014.11E+01
5.55E+065.52E+06
OzBLAS (d=3, fast)OzBLAS (d=3)
6.28E-125.07E-13
1.80E-081.17E-09
2.08E-044.42E-05
8.99E+006.82E+00
OzBLAS (d=4, fast)OzBLAS (d=4)
00
7.42E-158.25E-16
2.87E-107.79E-11
3.65E-042.78E-04
OzBLAS (d=6, fast)OzBLAS (d=6)
00
00
00
4.37E-160
Maximum relative error to MPFR (2048-bit)(GEMM, m=n=k=1000, input: (rand-0.5)×exp(φ×randn))
• Accuracy depends on (1) the range of the absolute values in the inputs, (2) the dimension (for dot-product-wise), (3) the number of split-matrices (d)
• Ozaki scheme can be less accurate than cublasDgemm (shown in red)
†The results were compared after rounded to double-precision
Reproducibility
18
Results with various implementations(DOT, n=10000)
CPU (Xeon E5-2623v4) GPU (Titan V)MPFR (2048-bit) 0x1.33b6d7b84f15cp+7 N/AcuBLAS (double) N/A 0x1.33b6d7b84f15bp+7MKL (double) 0x1.33b6d7b84f15ep+7 N/AOzBLAS (d=1) 0x1.33b4bbaep+7 0x1.33b4bbaep+7OzBLAS (d=2) 0x1.33b6d7b83efffp+7 0x1.33b6d7b83efffp+7OzBLAS (d=3) 0x1.33b6d7b84f15cp+7 0x1.33b6d7b84f15cp+7
OzBLAS:• CPU version relies on MKL and GPU version relies on cuBLAS• when d<3, the results are not correct but reproducible
between CPU and GPU
cuOzBLAS (NVIDIA Titan V)
19
0
20
40
60
80
100
2 18 2 19 2 20 2 21 2 22 2 23 2 24 2 25
%
Problem size (n)
DDOT on Titan V (d=6)
0
20
40
60
80
100
20483072
40965120
61447168
81929216
10240
%
Problem size (m=n)
DGEMV on Titan V (d=6)
0
20
40
60
80
100
10241536
20482560
30723584
40964608
5120
%
Problem size (m=n=k)
DGEMM on Titan V (d=6)SplitASplitBComp
SumOther
0 5
10 15 20 25 30 35 40 45
218 220 222 224
Ove
rhea
d to
cuB
LAS
Problem Size (n)
DDOT on Titan V
0 5
10 15 20 25 30 35 40 45
2048 4096 6144 8192 10240
Ove
rhea
d to
cuB
LAS
Problem Size (m=n)
DGEMV on Titan V
0 5
10 15 20 25 30 35 40 45
1024 2048 3072 4096 5120
Ove
rhea
d to
cuB
LAS
Problem Size (m=n=k)
DGEMM on Titan Vd=2
d=2,fastd=3
d=3,fastd=4
d=4,fastd=6
d=6,fast
0
100
200
300
400
500
600
700
218 220 222 224
GB/
s
Problem Size (n)
DDOT on Titan V
0
100
200
300
400
500
600
700
2048 4096 6144 8192 10240
GB/
s
Problem Size (m=n)
DGEMV on Titan V
0
1000
2000
3000
4000
5000
6000
7000
1024 2048 3072 4096 5120
GFl
ops
Problem Size (m=n=k)
DGEMM on Titan VcuBLAS
d=2d=2,fast
d=3d=3,fast
d=4d=4,fast
d=6d=6,fast
OzBLAS (Intel Xeon W-2123)
20
0
20
40
60
80
100
262144
524288
1048576
2097152
4194304
8388608
16777216
33554432
%
DDOT on Xeon Wï2123 (d=6)
0
20
40
60
80
100
20483072
40965120
61447168
81929216
10240
%
Problem size (m=n)
DGEMV on Xeon Wï2123 (d=6)
0
20
40
60
80
100
10241536
20482560
30723584
40964608
5120
%
Problem size (m=n=k)
DGEMM on Xeon Wï2123 (d=6)SplitASplitBComp
SumOther
0 10 20 30 40 50 60 70 80
218 220 222 224
Ove
rhea
d to
MKL
Problem Size (n)
DDOT on Xeon Wï2123
0 10 20 30 40 50 60 70 80
2048 4096 6144 8192 10240
Ove
rhea
d to
MKL
Problem Size (m=n)
DGEMV on Xeon Wï2123
0 10 20 30 40 50 60 70 80
1024 2048 3072 4096 5120
Ove
rhea
d to
MKL
Problem Size (m=n=k)
DGEMM on Xeon Wï2123d=2
d=2,fastd=3
d=3,fastd=4
d=4,fastd=6
d=6,fast
0
20
40
60
80
100
120
140
160
218 220 222 224
GB/
s
Problem Size (n)
DDOT on Xeon Wï2123
0 5
10 15 20 25 30 35 40 45 50
2048 4096 6144 8192 10240
GB/
s
Problem Size (m=n)
DGEMV on Xeon Wï2123
0
50
100
150
200
250
300
350
400
1024 2048 3072 4096 5120
GFl
ops
Problem Size (m=n=k)
DGEMM on Xeon Wï2123MKLd=2
d=2,fastd=3
d=3,fastd=4
d=4,fastd=6
d=6,fast
0
0.2
0.4
0.6
0.8
1
1.2
0 2048 4096 6144 8192
TFlo
ps (o
n D
P)
Problem Size (m=n=k)
DGEMMïTC (CR) on Tesla V100
phi=0phi=1phi=2
Accurate DGEMM using Tensor Cores
21
Tesla V100• 7 TFlops on FP64• 113 TFlops TC
Throughput (TFlops on FP64) on Tesla V100 (Volta), CUDA 10.1Input’s exponent range (min/max): -10/-1 (p=0), -9/+2 (p=1), -9/+4 (p=2)
With Ozaki scheme, accurate DGEMM can be built upon cublasGemmEx using Tensor Cores !! [5] (a paper incl. some extensions will be published soon…)
[5] D. Mukunoki, K. Ozaki, T. Ogita: Accurate DGEMM using Tensor Cores, HPC Asia Poster session, Jan. 2020http://sighpc.ipsj.or.jp/HPCAsia2020/hpcasia2020_poster_abstracts/poster_abstract_09.pdf
Conclusionn Summary
l Two accurate BLAS implementations: OzBLAS and BLAS-DOT21. BLAS-DOT2 [1]: 2-fold precision for inner product with Dot2 algorithm2. OzBLAS [3]: Tunable accuracy (incl. correct-rounding) with Ozaki schemel Both are interface compatible with standard BLAS for FP64 data, but the
computation is performed with an accurate methodl Code: http://www.math.twcu.ac.jp/ogita/post-k/results.html
n Future workl Full-set implementationl For high-precision inputs:
¥ BLAS-Dot2 -> BLAS-DD (already available as MBLAS and QPBLAS)¥ Ozaki scheme for binary128 and more
l Demo of the use in minimal-precision computing
22