accurate blas implementations: ozblasand blas-dot2 · 2020. 2. 26. · (fp16, fp128) fast accurate...

Accurate BLAS Implementations:OzBLAS and BLAS-DOT2

LSPANC2020January, January 30, 2020, R-CCSDaichi Mukunoki (R-CCS)

Research Collaborators:Takeshi Ogita (Tokyo Woman’s Christian University)Katsuhisa Ozaki (Shibaura Institute of Technology)

Acknowledgement:This research was partially supported by MEXT as “Exploratory Issue on Post-K computer” (Development of verified numericalcomputations and super high performance computing environment for extreme researches) and the Japan Society for thePromotion of Science (JSPS) KAKENHI Grant Number 19K20286.

Fast Accurate Multi-Precision Numerical Libraryl Needs in Minimal-Precision Computing

1. Fast computation of arbitrary-precision (incl. accurate) result on CPU/GPU2. Accurate numerical validation (or verification)

2

Hardware

Heterogeneous System

ArithmeticLibrary

Numerical Libraries

NumericalValidation

Precision Tuning

Arbitrary- Precision

FP32, FP64,(FP16, FP128)

Fast AccurateMethods / Libraries

others…LAPACK

BLAS

MPFR

Stochastic ArithmeticSAMCADNA

PROMISE with SAM

PROMISE

DD/QD

FPGA

acceleration

GPUGPUNymble

FloPoCo

available

in development

SPGenfor arbitrary-prec.

Compiler & Tools for FPGA

CPUacceleration

others…OzBLASExBLAS

QPEigenMPLAPACK (MPBLAS)

Accurate BLAS Implementations

3

Platform DataPrecision

ComputationPrecision

BLAS (netlib) CPU, GPU FP32/64 FP32/64

XBLAS (netlib) CPU FP32/64 FP32/64/2xFP64

BLAS-DOT2 (TWCU) GPU FP64 2xFP64

ReproBLAS (Berkeley) CPU FP64 Tunable

OzBLAS (TWCU/RIKEN) CPU, GPU FP64 Tunable (correct-rounding)

ExBLAS (Sorbonne U.) CPU, GPU FP64 Correct rounding

QPBLAS (JAEA/RIKEN) CPU, GPU DD DD

MBLAS / MPLAPACK

(M. Nakata)CPU

FP64/DD/

QD,MPFRFP64/DD/QD/MPFR

…

Examples of accurate BLAS implementations

Red: our work


4










MBLAS / MPLAPACK

(M. Nakata)CPU

FP64/DD/


…


Red: our work

BLAS-DOT2n BLAS-DOT2 [1]

l Input/output: FP64, computation: 2x FP64 (same as XBLAS)l Based on a 2-fold precision inner-product algorithm: “Dot2” [2]

¥ Based on error-free addition & multiplication, similar to double-double¥ Accuracy depends on the condition number

l Theoretical performance overhead compared to cuBLAS FP64 routines¥ DOT & GEMV: no overhead, GEMM: 11x slower

5

[1] Daichi Mukunoki, Takeshi Ogita: Performance and Energy Consumption of Accurate and Mixed-precision Linear Algebra Kernels on GPUs, Journal of Computational and Applied Mathematics, Vol. 372, p. 112701, 2020.[2] T. Ogita, S. M. Rump, S. Oishi, Accurate Sum and Dot Product, SIAM J. Scientific Computing 26 (2005) 1955–1988. Code: http://www.math.twcu.ac.jp/ogita/post-k/results.html

http://www.math.twcu.ac.jp/ogita/post-k/results.html

Accurate Inner-Product: Dot2 [Ogita et al. 2005]

6

Notation:• fl (…): computation performed with floating-point arithmetic (in our case FP64)• fma (…): computation performed with fused multiply-add• ulp (a): unit in the last place

Accurate inner-product “Dot2” [Ogita et al. 2005]: r = xTy, r, x, y ∈ Fn

function r = Dot2 (x, y)[p, s] = TwoProd (x0, y0)for i = 1 : n-1

[h, r] = TwoProd (xi, yi)[p, q] = TwoSum (p, h)s = fl (s + (q + r))

r = fl (p + s)end function


7

function [s, e] = TwoSum (a, b)s = fl (a + b)v = fl (s - a)e = fl (a - (s - v)) + (b - v))

end function

function [p, e] = TwoProd (a, b)p = fl (a×b)e = fma (a×b - p)

end function

Error-free transformation for addition [Knuth 1969]: s + e = a + b, s = fl (a + b)

Error-free transformation for multiplication [Karp et al. 1997]: p + e = a × b, p = fl (a × b)







8

function [s, e] = TwoSum (a, b)s = fl (a + b)v = fl (s - a)e = fl (a - (s - v)) + (b - v))

end function

function [p, e] = TwoProd (a, b)p = fl (a×b)e = fma (a×b - p)

end function

Error-free transformation for addition [Knuth 1969]: s + e = a + b, s = fl (a + b)

Error-free transformation for multiplication [Karp et al. 1997]: p + e = a × b, p = fl (a × b)






11 instructions(11x overhead thanFP64 DOT using FMA)

Implementationn Matrix multiplication (GEMM) on CUDA

l Each thread computes an element of matrix C using Dot2l 2D thread-block and 2D grid utilizing 2D data locality on matrixl Shared-memory blocking l To achieve nearly the theoretical peak performance is not so hard thanks to the high

compute-intensity (11x higher than double)

9

k

m

k

n

Matrix A

Matrix B

Matrix C

16 elements16 threads

16 elements16 threads

Shared-memoryA thread-blockcomputesa block

A thread computesan element in a block

Performance on Titan V

10

Throughput (Ops/s) on Titan V (Volta), CUDA 10.0

0

1

2

3

4

5

6

7

0 2048 4096

6144 8192

TOps

/s

Problem Size (m=n=k)

DGEMM

cuBLAS DOT2

0 20 40 60 80

100 120 140 160 180

0 2048 4096

6144 8192

10240

GO

ps/s

Problem Size (m=n)

DGEMV

cuBLAS DOT2

Execution time overhead of Dot2 v.s. cuBLAS FP64 routine:• DOT & GEMV: Almost no overhead as memory-bound• GEMM: 11x slower as compute-bound

0

10

20

30

40

50

60

70

80

2 14 2 16 2 18 2 20 2 22 2 24

GO

ps/s

Problem Size (n)

DDOT

cuBLAS DOT2

Ops: number of mathematical computations of the BLAS operation

11x

Bet

ter


11










MBLAS / MPLAPACK

(M. Nakata)CPU

FP64/DD/


…


Red: our work

OzBLASn OzBLAS [3]

l Input/output: FP64, computation: tunable (including correct-rounding)

l Based on the error-free transformation for dot/mat-mul: “Ozaki scheme” [3]

¥ Accuracy depends on the range of the absolute values and the dimension (inner-

product direction) of the input

l Full-scratch implementation is not needed: the kernel computation (most time-

consuming part) can be performed using standard FP64 BLAS

¥ High performance and low development cost

l Speed compared to cuBLAS depends on the input

¥ At least more than 4x (on GEMM) and 8x (on DOT & GEMV)

12

[3] D. Mukunoki, T. Ogita, K. Ozaki: Reproducible BLAS Routines with Tunable Accuracy Using Ozaki Scheme for Many-core Architectures, Proc. PPAM2019, 2019 (accepted).[4] K. Ozaki, T. Ogita, S. Oishi, S. M. Rump: Error-free transformations of matrix multiplication by using fast routines of matrix multiplication and its applications. Numer. Algorithms 59(1), 95–118 (2012) Code: http://www.math.twcu.ac.jp/ogita/post-k/results.html


Error-Free Transformation for Dot/Mat-mul: Ozaki Scheme [Ozaki et al. 2012]

13

(1) Input vectors x, y are split into several vectors:x = x(1) + x(2) + ··· + x(p) + ··· + x(sx)

y = y(1) + y(2) + ··· + y(q) + ··· + y(sy) x(p), y(q) ∈ Fn

Here, the splitting is performed to meet the following properties:1. (x(p))Ty(q) = fl ((x(p))Ty(q)) for any p and q 2. x(p)(i) and y(q)(j) are non-zeros, |x(p)(i)| ≧ |x(p+1)(i)|, |y(q)(j)| ≧ |y(q+1)(j)|

Ø The number of split vectors required to achieve correct-rounding depends on the range of the absolute values and length of the input vectors

DOT (xTy, x, y ∈ Fn) with Ozaki scheme:

Error-Free Transformation for Dot/Mat-mul: Ozaki Scheme [Ozaki et al. 2012]

14

(1) Input vectors x, y are split into several vectors:x = x(1) + x(2) + ··· + x(p) + ··· + x(sx)

y = y(1) + y(2) + ··· + y(q) + ··· + y(sy) x(p), y(q) ∈ Fn

Here, the splitting is performed to meet the following properties:1. (x(p))Ty(q) = fl ((x(p))Ty(q)) for any p and q 2. x(p)(i) and y(q)(j) are non-zeros, |x(p)(i)| ≧ |x(p+1)(i)|, |y(q)(j)| ≧ |y(q+1)(j)|

Ø The number of split vectors required to achieve correct-rounding depends on the range of the absolute values and length of the input vectors

(2) xTy is transformed to a summation of several dot-productsxTy = (x(1))Ty(1) + (x(1))Ty(2) + (x(1))Ty(3) + ··· + (x(1))Ty(sy)

+ (x(2))Ty(1) + (x(2))Ty(2) + (x(2))Ty(3) + ··· + (x(2))Ty(sy)

+ (x(3))Ty(1) + (x(3))Ty(2) + (x(3))Ty(3) + ··· + (x(3))Ty(sy)

+ ···+ (x(sx))Ty(1) + (x(sx))Ty(2) + (x(sx))Ty(3) + ··· + (x(sx))Ty(sy)

With a correctly-rounded summation (e.g. NearSum [Rump et al. 2005]), the final result achieves correct-rounding.

DOT (xTy, x, y ∈ Fn) with Ozaki scheme:

Those parts can be computed using FP64 operation (DDOT)

GEMM with Ozaki scheme

15C=Sum(C(0)+C(1)+C(2)+C(3))

Splitting of input matrices A and B

Multiplications of the split matrices by DGEMM

Summation of the computed split matrices by NearSum

A

B

B(1)B(2)

B(3)

B(sy)

A(1)A(2)

A(3)

A(sx)…

B(sy)

A(1)

A(2)

A(3)

A(sx)…

B(3)B(2)B(1)

C(1, 1) C(1, 2)

C(2,2) C(2,3)C (2, 1)

C(3, 1)

C(1,sy)C(1, 3)

C(3,2) C(3,sy)C(3, 3)

C(2,sy)

C(sx, 1) C(sx,2) C(sx,sy)C(sx,3)

…

…

… …

…

C(1, 1)C(1, 2)

C(1,p)

C

C(2,1)

…

C(2,2)

…

…sy

matrice

s

sxmatr

ices

Summati

on

Split

Split1

3

Matrix Multiplications2 Those GEMMs can be computed using standard DGEMMs

Performance on Titan V

16

Throughput (Ops/s) of Correctly-Rounded Operation on Titan V (Volta), CUDA 10.1

0

10

20

30

40

50

60

70

80

2 14 2 16 2 18 2 20 2 22 2 24

GO

ps/s

Problem Size (n)

DDOT

cuBLASOzaki(p=0)Ozaki(p=1)

0

1

2

3

4

5

6

7

0 2048 4096

6144 8192

TOps

/s


DGEMM


0 20 40 60 80

100 120 140 160 180

0 2048 4096

6144 8192

10240

GO

ps/s

Problem Size (m=n)

DGEMV


Ops: number of mathematical computations of the BLAS operationInput’s exponent range (min/max): -10/-1 (p=0), -9/+2 (p=1)

Execution time overhead v.s. cuBLAS FP64 routine depends on the input(the range of the absolute values of the input vectors/matrices)

e.g., more than 10x on DOT & GEMV and 4x on GEMM in above cases at minimum

Bet

ter

Accuracy (tunable-accuracy version)

17

Input (min/max exp) φ=1 (-07/+01) φ=4 (-10/+07) φ=7 (-17/+13) φ=10 (-24/+19)

cuBLAS (double) 6.14E-10 2.95E-11 2.22E-10 2.27E-10

OzBLAS (d=2, fast)OzBLAS (d=2)

4.75E-051.98E-06

3.32E-024.61E-03

3.92E+014.11E+01

5.55E+065.52E+06


6.28E-125.07E-13

1.80E-081.17E-09

2.08E-044.42E-05

8.99E+006.82E+00


00

7.42E-158.25E-16

2.87E-107.79E-11

3.65E-042.78E-04


00

00

00

4.37E-160

Maximum relative error to MPFR (2048-bit)(GEMM, m=n=k=1000, input: (rand-0.5)×exp(φ×randn))

• Accuracy depends on (1) the range of the absolute values in the inputs, (2) the dimension (for dot-product-wise), (3) the number of split-matrices (d)

• Ozaki scheme can be less accurate than cublasDgemm (shown in red)

†The results were compared after rounded to double-precision

Reproducibility

18

Results with various implementations(DOT, n=10000)

CPU (Xeon E5-2623v4) GPU (Titan V)MPFR (2048-bit) 0x1.33b6d7b84f15cp+7 N/AcuBLAS (double) N/A 0x1.33b6d7b84f15bp+7MKL (double) 0x1.33b6d7b84f15ep+7 N/AOzBLAS (d=1) 0x1.33b4bbaep+7 0x1.33b4bbaep+7OzBLAS (d=2) 0x1.33b6d7b83efffp+7 0x1.33b6d7b83efffp+7OzBLAS (d=3) 0x1.33b6d7b84f15cp+7 0x1.33b6d7b84f15cp+7

OzBLAS:• CPU version relies on MKL and GPU version relies on cuBLAS• when d<3, the results are not correct but reproducible

between CPU and GPU

cuOzBLAS (NVIDIA Titan V)

19

0

20

40

60

80

100

2 18 2 19 2 20 2 21 2 22 2 23 2 24 2 25

%

Problem size (n)

DDOT on Titan V (d=6)

0

20

40

60

80

100

20483072

40965120

61447168

81929216

10240

%

Problem size (m=n)

DGEMV on Titan V (d=6)

0

20

40

60

80

100

10241536

20482560

30723584

40964608

5120

%

Problem size (m=n=k)

DGEMM on Titan V (d=6)SplitASplitBComp

SumOther

0 5

10 15 20 25 30 35 40 45

218 220 222 224

Ove

rhea

d to

cuB

LAS

Problem Size (n)

DDOT on Titan V

0 5

10 15 20 25 30 35 40 45

2048 4096 6144 8192 10240

Ove

rhea

d to

cuB

LAS

Problem Size (m=n)

DGEMV on Titan V

0 5

10 15 20 25 30 35 40 45

1024 2048 3072 4096 5120

Ove

rhea

d to

cuB

LAS


DGEMM on Titan Vd=2

d=2,fastd=3

d=3,fastd=4

d=4,fastd=6

d=6,fast

0

100

200

300

400

500

600

700

218 220 222 224

GB/

s

Problem Size (n)

DDOT on Titan V

0

100

200

300

400

500

600

700

2048 4096 6144 8192 10240

GB/

s

Problem Size (m=n)

DGEMV on Titan V

0

1000

2000

3000

4000

5000

6000

7000

1024 2048 3072 4096 5120

GFl

ops


DGEMM on Titan VcuBLAS

d=2d=2,fast

d=3d=3,fast

d=4d=4,fast

d=6d=6,fast

OzBLAS (Intel Xeon W-2123)

20

0

20

40

60

80

100

262144

524288

1048576

2097152

4194304

8388608

16777216

33554432

%

DDOT on Xeon Wï2123 (d=6)

0

20

40

60

80

100

20483072

40965120

61447168

81929216

10240

%

Problem size (m=n)

DGEMV on Xeon Wï2123 (d=6)

0

20

40

60

80

100

10241536

20482560

30723584

40964608

5120

%

Problem size (m=n=k)

DGEMM on Xeon Wï2123 (d=6)SplitASplitBComp

SumOther

0 10 20 30 40 50 60 70 80

218 220 222 224

Ove

rhea

d to

MKL

Problem Size (n)

DDOT on Xeon Wï2123

0 10 20 30 40 50 60 70 80

2048 4096 6144 8192 10240

Ove

rhea

d to

MKL

Problem Size (m=n)

DGEMV on Xeon Wï2123

0 10 20 30 40 50 60 70 80

1024 2048 3072 4096 5120

Ove

rhea

d to

MKL


DGEMM on Xeon Wï2123d=2

d=2,fastd=3

d=3,fastd=4

d=4,fastd=6

d=6,fast

0

20

40

60

80

100

120

140

160

218 220 222 224

GB/

s

Problem Size (n)

DDOT on Xeon Wï2123

0 5

10 15 20 25 30 35 40 45 50

2048 4096 6144 8192 10240

GB/

s

Problem Size (m=n)

DGEMV on Xeon Wï2123

0

50

100

150

200

250

300

350

400

1024 2048 3072 4096 5120

GFl

ops


DGEMM on Xeon Wï2123MKLd=2

d=2,fastd=3

d=3,fastd=4

d=4,fastd=6

d=6,fast

0

0.2

0.4

0.6

0.8

1

1.2

0 2048 4096 6144 8192

TFlo

ps (o

n D

P)


DGEMMïTC (CR) on Tesla V100

phi=0phi=1phi=2

Accurate DGEMM using Tensor Cores

21

Tesla V100• 7 TFlops on FP64• 113 TFlops TC

Throughput (TFlops on FP64) on Tesla V100 (Volta), CUDA 10.1Input’s exponent range (min/max): -10/-1 (p=0), -9/+2 (p=1), -9/+4 (p=2)

With Ozaki scheme, accurate DGEMM can be built upon cublasGemmEx using Tensor Cores !! [5] (a paper incl. some extensions will be published soon…)

[5] D. Mukunoki, K. Ozaki, T. Ogita: Accurate DGEMM using Tensor Cores, HPC Asia Poster session, Jan. 2020http://sighpc.ipsj.or.jp/HPCAsia2020/hpcasia2020_poster_abstracts/poster_abstract_09.pdf

http://sighpc.ipsj.or.jp/HPCAsia2020/hpcasia2020_poster_abstracts/poster_abstract_09.pdf

Conclusionn Summary

l Two accurate BLAS implementations: OzBLAS and BLAS-DOT21. BLAS-DOT2 [1]: 2-fold precision for inner product with Dot2 algorithm2. OzBLAS [3]: Tunable accuracy (incl. correct-rounding) with Ozaki schemel Both are interface compatible with standard BLAS for FP64 data, but the

computation is performed with an accurate methodl Code: http://www.math.twcu.ac.jp/ogita/post-k/results.html

n Future workl Full-set implementationl For high-precision inputs:

¥ BLAS-Dot2 -> BLAS-DD (already available as MBLAS and QPBLAS)¥ Ozaki scheme for binary128 and more

l Demo of the use in minimal-precision computing

22


accurate blas implementations: ozblasand blas-dot2 · 2020. 2. 26. · (fp16, fp128) fast accurate...

Documents