department of electronic engineering, tsinghua university nano-scale integrated circuit and system...

34
Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. GPU Sparse LU Factorization and Its Application in Circuit Simulation Nano-scale Integrated Circuit and System Lab., EE Department, Tsinghua University Ling Ren 1

Upload: bryan-obrien

Post on 17-Dec-2015

220 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. GPU Sparse LU Factorization and Its Application

Department of Electronic Engineering, Tsinghua University

1Nano-scale Integrated Circuit and System Lab.

GPU Sparse LU Factorization and Its Application in Circuit Simulation

Nano-scale Integrated Circuit and System Lab.,EE Department, Tsinghua University

Ling Ren

Page 2: Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. GPU Sparse LU Factorization and Its Application

Nano-scale Integrated Circuit and System Lab.

Department of Electronic Engineering, Tsinghua University

2

Abstract

First work on GPU sparse LU factorization Algorithm description: elimination graph (EGraph) Algorithm analysis: parallelism in left-looking Algorithm implementation: timing order on GPU

Supplement to OpenCL BLAS Current cl_AMDBLAS has Triangular Solve but no LU Objective of LU:

Page 3: Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. GPU Sparse LU Factorization and Its Application

Nano-scale Integrated Circuit and System Lab.

Department of Electronic Engineering, Tsinghua University

3

Outline

Background Sparse LU factorization Dense LU factorization Summary

Page 4: Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. GPU Sparse LU Factorization and Its Application

Nano-scale Integrated Circuit and System Lab.

Department of Electronic Engineering, Tsinghua University

4

Background SPICE: the most popular circuit simulator

Simulating VSLI (~1 billion transistors) takes several days Bottleneck: Sparse LU factorization

Dynamic fluids, structural, economics …

Bottleneck

Page 5: Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. GPU Sparse LU Factorization and Its Application

Nano-scale Integrated Circuit and System Lab.

Department of Electronic Engineering, Tsinghua University

5

Outline

Background Sparse LU factorization Dense LU factorization Summary

Page 6: Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. GPU Sparse LU Factorization and Its Application

Nano-scale Integrated Circuit and System Lab.

Department of Electronic Engineering, Tsinghua University

6

Sparse LU factorization - related works [SuperLU 1999]

• Sequential, multi-thread, distributed versions• Incorporate Supernode, efficent for dense blocks

[Pardiso 2002]• Sequential, multi-thread, distributed, GPU [Christen2007]

versions• Adopt Supernode

But supernodes rarely form in circuit matrices [KLU 2010]

• Optimized for circuit matrices• Only sequential, use G/P left looking algorithm [G/P 1988]• Adopt BTF, without Supernode

Page 7: Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. GPU Sparse LU Factorization and Its Application

Nano-scale Integrated Circuit and System Lab.

Department of Electronic Engineering, Tsinghua University

Sparse LU factorization – left-looking Sequentially process each column When processing column k, use all the columns on the

left (1, 2, ..., k-1) to update column k. Update = vector multiply-and-add (MAD)

7

a a c b

c

a

b

c

a

b

c

a

b

read

write

•read+write>arithmeticUpdate

Page 8: Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. GPU Sparse LU Factorization and Its Application

Nano-scale Integrated Circuit and System Lab.

Department of Electronic Engineering, Tsinghua University

Algorithm description – EGraph Every column is updated with several columns on its left Nonzero structure of U determines the dependency

8

Vector MAD

(b)EGraph(a) Upper triangular

matrix U

nonzero

Page 9: Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. GPU Sparse LU Factorization and Its Application

Nano-scale Integrated Circuit and System Lab.

Department of Electronic Engineering, Tsinghua University

9

Algorithm analysis – two kinds of parallelism

Pipeline parallelism, alone with timing order

Column 1

Column 2

Column 3

Column 4

......

......

Overlapped factorization in pipeline mode

Thread 1

Thread 2

Divide columns into levels: columns in the same level are independent of each other Cluster mode: many columns factorized in parallel Pipeline mode: Overlap columns from different levels

Page 10: Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. GPU Sparse LU Factorization and Its Application

Nano-scale Integrated Circuit and System Lab.

Department of Electronic Engineering, Tsinghua University

10

Sparse LU factorization - workflow

Page 11: Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. GPU Sparse LU Factorization and Its Application

Nano-scale Integrated Circuit and System Lab.

Department of Electronic Engineering, Tsinghua University

11

Sparse LU factorization - preprocessing

Preprocessing: only once on CPU MC64 to ensure numerical stability [MC64]; Approximate Minimum Degree to reduce fill-ins

[AMD] ; pre-factorization (numeric factorization with partial

pivoting) to calculate the symbolic structure of L and U.

Sorting the nonzeros of L and U (introduced later)

Page 12: Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. GPU Sparse LU Factorization and Its Application

Nano-scale Integrated Circuit and System Lab.

Department of Electronic Engineering, Tsinghua University

12

Sparse LU factorization – on GPU GPU inputs

Location and values of nonzeros in A Location of nonzeros in L and U The Escheduler

GPU outputs Values of nonzeros in L and U

CSC (Compressed Sparse Column) format for sparse matrices A, L and U

Page 13: Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. GPU Sparse LU Factorization and Its Application

Nano-scale Integrated Circuit and System Lab.

Department of Electronic Engineering, Tsinghua University

13

Sparse LU factorization - avoid deadlock In traditional GPU programs, some wavefronts are inactive at

the beginning (limited resource etc.). They wait for other active wavefronts to finish and then become active.

But in sparse LU, we must ensure all wavefronts are active from the beginning

Page 14: Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. GPU Sparse LU Factorization and Its Application

Nano-scale Integrated Circuit and System Lab.

Department of Electronic Engineering, Tsinghua University

14

Sparse LU factorization - data formats

data formats for intermediate results:dense arrays vs. CSC

CSC (Compressed Sparse Column)• Can be put in local memory• Indexed accesses inconvenient (binary search)• Using too much local memory reduces active work-

groups, which leads to severe performance loss Dense arrays > CSC format: 2.5x

Page 15: Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. GPU Sparse LU Factorization and Its Application

Nano-scale Integrated Circuit and System Lab.

Department of Electronic Engineering, Tsinghua University

15

Sparse LU factorization - data locality Higher global memory bandwidth if consecutive

work-items access consecutive address

Improve data locality Nonzeros of L and U are out-of-order after preprocessing,

sort them according to row indices

1.7x speedup, overheads negligible Performed only once, incorporated into preprocessing

Page 16: Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. GPU Sparse LU Factorization and Its Application

Nano-scale Integrated Circuit and System Lab.

Department of Electronic Engineering, Tsinghua University

16

Experimental setups CPU

2 Xeon E5405 CPUs (8 cores in total) 2x6 MB L2 cache, 16GB ram

GPU AMD Radeon 5870 GPU

Testing matrices University of Florida Sparse Matrix Collection [Davis]

http://www.cise.ufl.edu/research/sparse/matrices/

Page 17: Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. GPU Sparse LU Factorization and Its Application

Nano-scale Integrated Circuit and System Lab.

Department of Electronic Engineering, Tsinghua University

17

Sparse LU factorization - Experimental results GPU speedups positively related to floating point

operations (flops)

Page 18: Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. GPU Sparse LU Factorization and Its Application

Nano-scale Integrated Circuit and System Lab.

Department of Electronic Engineering, Tsinghua University

18

Sparse LU factorization - Experimental results

Matrices divided into 4 groups First three groups according to Mflops

• GPU speedup positively related to Mflops 4th group: denormal floating point numbers

• Used to represent extremely small numbers• Very slowly on CPU, full speed support on GPU

An advantage of GPU in sparse LU and scientific computing

• Very high speedups for this group

Page 19: Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. GPU Sparse LU Factorization and Its Application

Nano-scale Integrated Circuit and System Lab.

Department of Electronic Engineering, Tsinghua University

19

Sparse LU factorization - Experimental results

Average speedup of each group

Group GPU bandwidth

GB / s

Over 1 CPU

Over 4 CPUs

Over 8 CPUs

Over KLU

1 0.81 0.41 0.24 0.22 0.58

2 10.97 2.43 0.85 0.55 3.64

3 52.59 10.53 3.65 2.58 15.58

4 36.82 26.86 8.01 4.48 25.61

All 15.91 4.51 1.64 1.13 6.25

Page 20: Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. GPU Sparse LU Factorization and Its Application

Nano-scale Integrated Circuit and System Lab.

Department of Electronic Engineering, Tsinghua University

20

Scalability – BBD Problem

How to use multiple GPUs?

Circuit-partition-based simulation algorithm bordered-block-diagonal (BBD) Diagonal blocks are factorized

independently

But An becomes dense. So we need dense LU factorization

Page 21: Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. GPU Sparse LU Factorization and Its Application

Nano-scale Integrated Circuit and System Lab.

Department of Electronic Engineering, Tsinghua University

21

Outline

Background Sparse LU factorization Dense LU factorization Summary

Page 22: Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. GPU Sparse LU Factorization and Its Application

Nano-scale Integrated Circuit and System Lab.

Department of Electronic Engineering, Tsinghua University

22

Dense LU Factorization – blocked algorithm Three core operations

Dense LU factorization Triangular matrix inversion Matrix multiplication

Suitable for GPU GEMM most frequent GEMM very efficient on GPU

• 920 Gflop/s (single), 290 Gflop/s (double)finished LU + inverse GEMM

Page 23: Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. GPU Sparse LU Factorization and Its Application

Nano-scale Integrated Circuit and System Lab.

Department of Electronic Engineering, Tsinghua University

23

443 Gflop/s (single), 163 Gflop/s (double)

Dense LU Factorization – performance

Page 24: Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. GPU Sparse LU Factorization and Its Application

Nano-scale Integrated Circuit and System Lab.

Department of Electronic Engineering, Tsinghua University

24

Comparison to previous studies

Dense LU Factorization – related works

Performance of Dense LU FactorizationWork Hardware Single Double

[Galoppo2005] GTX 7800 10 --

[Volkov2008] GTX 8800 179 --

[Tomov2010] 8 Xeon Harpertown 100 50

[Tomov2010] GTX 280 300 --

[Tomov2010] 8 Xeon Harpertown + GTX 280 388 99

Ours Radeon 5870 443 163

Page 25: Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. GPU Sparse LU Factorization and Its Application

Nano-scale Integrated Circuit and System Lab.

Department of Electronic Engineering, Tsinghua University

25

Dense LU Factorization – further improvement

CPU BLAS for Gaussian elimination 100 Gflop/s GEMM can be further improved

Scalability to multiple GPUs Blocked dense LU: independent GEMMs on multiple GPUs Diagonal blocks in BBD on multiple GPUs Linear performance improvement expected

Page 26: Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. GPU Sparse LU Factorization and Its Application

Nano-scale Integrated Circuit and System Lab.

Department of Electronic Engineering, Tsinghua University

26

Summary First work on GPU sparse LU factorization

Exploit parallelism of left-looking algorithm Blocked dense LU factorization

443 Gflop/s (single), 163 Gflop/s (double)

Supplement to OpenCL BLAS Accelerate SPICE simulators

Page 27: Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. GPU Sparse LU Factorization and Its Application

Nano-scale Integrated Circuit and System Lab.

Department of Electronic Engineering, Tsinghua University

27

Reference [SPICE] L. W. Nagel, “SPICE 2: A computer program to stimulate semiconductor

circuits,” Ph.D. dissertation, University of California, Berkeley, 1975. [SuperLU1999] J. W. Demmel, S. C. Eisenstat, J. R. Gilbert, X. S. Li, and J. W. H. Liu,

“A supernodal approach to sparse partial pivoting,” SIAM J. Matrix Analysis and Applications, vol. 20, no. 3, pp. 720–755, 1999

[Pardiso2002] O. Schenk and K. Gartner, “Solving unsymmetric sparse systems of linear equations with pardiso,” Computational Science - ICCS 2002, vol. 2330, pp. 355–363, 2002.

[G/P 1988] J. R. Gilbert and T. Peierls, “Sparse partial pivoting in time proportional to arithmetic operations,” SIAM J. Sci. Statist. Comput., vol. 9, pp. 862– 874, 1988

[KLU2010] T. A. Davis and E. Palamadai Natarajan, “Algorithm 907: KLU, a direct sparse solver for circuit simulation problems,” ACM Trans. Math. Softw., vol. 37, pp. 36:1–36:17, September 2010.

Page 28: Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. GPU Sparse LU Factorization and Its Application

Nano-scale Integrated Circuit and System Lab.

Department of Electronic Engineering, Tsinghua University

28

Reference [Christen2007] M. Christen, O. Schenk, and H. Burkhart, “General-purpose sparse

matrix building blocks using the nvidia cuda technology platform,” 2007. [Davis] T. A. Davis and Y. Hu, “The university of florida sparse matrix collection,” to

appear in ACM Transactions on Mathematical Software. [Galoppo2005] N. Galoppo, N. K. Govindaraju, M. Henson, and D. Manocha, “LU-

GPU: Efficient algorithms for solving dense linear systems on graphics hardware,” SC Conference, vol. 0, p. 3, 2005.

[Volkov2008] V. Volkov and J. Demmel, “LU, QR and Cholesky factorizations using vector capabilities of gpus,” EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2008-49, May 2008.

[Tomov2010] S. Tomov, J. Dongarra, and M. Baboulin, “Towards dense linear algebra for hybrid gpu accelerated manycore systems,” Parallel Comput., vol. 36, pp. 232–240, June 2010.

Page 29: Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. GPU Sparse LU Factorization and Its Application

Nano-scale Integrated Circuit and System Lab.

Department of Electronic Engineering, Tsinghua University

29

Reference [MC64] I. S. Duff and J. Koster, “The design and use of algorithms for permuting

large entries to the diagonal of sparse matrices,” SIAM J. Matrix Anal. and Applics, no. 4, pp. 889–901, 1997.

[AMD] P. R. Amestoy, Enseeiht-Irit, T. A. Davis, and I. S. Duff, “Algorithm 837: AMD, an approximate minimum degree ordering algorithm,” ACM Trans. Math. Softw., vol. 30, pp. 381–388, September 2004.

Page 30: Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. GPU Sparse LU Factorization and Its Application

Department of Electronic Engineering, Tsinghua University

30Nano-scale Integrated Circuit and System Lab.

Thank you !

Nano-scale Integrated Circuit and System Lab.,EE Department, Tsinghua University

Page 31: Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. GPU Sparse LU Factorization and Its Application

Nano-scale Integrated Circuit and System Lab.

Department of Electronic Engineering, Tsinghua University

31

Sparse LU factorization – Terminology Elimination Graph Definition

An edge from j to k iff U(j, k) != 0 In the following context, node = column

• Level Definition– The length of the longest

path from any source node to itself.

– Source nodes have no incoming edges.

Page 32: Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. GPU Sparse LU Factorization and Its Application

Nano-scale Integrated Circuit and System Lab.

Department of Electronic Engineering, Tsinghua University

32

Sparse LU factorization - Experimental results

Page 33: Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. GPU Sparse LU Factorization and Its Application

Nano-scale Integrated Circuit and System Lab.

Department of Electronic Engineering, Tsinghua University

33

Dense LU factorization – Basic algorithm

Factorize to get and

Blocked LU factorization

Page 34: Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. GPU Sparse LU Factorization and Its Application

Nano-scale Integrated Circuit and System Lab.

Department of Electronic Engineering, Tsinghua University

34

Dense LU factorization – Basic algorithm

Repeat the process to obtain , , and so on