lu-gpu: efficient algorithms for solving dense linear

Post on 27-May-2022

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware

Nico Galoppo, Naga K. Govindaraju, Michael Henson, Dinesh Manocha

http://gamma.cs.unc.edu/LU-GPU

2The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Goals

Demonstrate advantages of mapping linear algebra routines to graphics hardware:

PerformanceGrowth rate

LAPACK compliant set of linear algebra algorithms on graphics hardware

3The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Outline

LU Decomposition & Related Work

The potential of GPUs

LU-GPU algorithm

Results

Conclusions & Ongoing Work

4The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

LU decomposition

Sequence of row eliminations: Scale and add: A(i,j) = A(i,j) – A(i,k) A(k,j)

Input data mapping: 2 distinct memory regions

No data dependencieswithin a row elimination

PivotingPointer-swap vs. data copy

k

k

k

5The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

LU decomposition

Theoretical complexity (partial pivoting): (2/3) n3 + O(n2)

Performance <> ArchitectureOrder of operationsMemory access (latency)Memory bandwidth

6The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Outline

LU Decomposition & Related Work

The potential of GPUs

LU-GPU algorithm

Results

Conclusions & Ongoing Work

7The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Commodity CPUs

LINPACK Benchmark:

Intel Pentium 4, 3.06 GHz: 2.88 GFLOPs/s

(Jack Dongarra, Oct 2005)

8The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Streaming architectures

Specialized hardwareHigh bandwidth/compute ratio

Merrimac [Erez04]Molecular modeling: 38 GFLOPs vs. 2.7 GFLOPs (P4)$1,000/node

Imagine [Ahn04]10.46 GFLOPs/s on QR-decomposition

Research

9The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Outline

LU Decomposition & Related Work

The potential of GPUs

LU-GPU algorithm

Results

Conclusions & Ongoing Work

10The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

CPU vs. GPU

Pentium EE 8403.2 GHz Dual Core230M Transistors90nm process206 mm2

2 x 1MB Cache25.6 GFLOPs

Price: $ 1,040

GeForce 7800 GTX430 MHz302M Transistors110 nm process326 mm2

512MB onboard memory313 GFLOPs (shader)1.3 TFLOPs (total)Price: $ 450

11The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

CPU vs. GPU(Henry Moreton: NVIDIA, Aug. 2005)

PEE 840 7800GTX GPU/CPU

Graphics GFLOPs 25.6 1300 50.8Shader GFLOPs 25.6 313 12.2Die area (mm2) 206 326 1.6Die area normalized 206 218 1.1Transistors (M) 230 302 1.3Power (W) 130 65 0.5GFLOPS/mm 0.1 6.0 47.9GFLOPS/tr 0.1 4.3 38.7GFLOPS/W 0.2 20.0 101.6

12The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

CPU vs. GPU: Bandwidth

CPU

(3 GHz)

System Memory

(2+ GB)

AGP Memory

(512 MB)

PCI-E Bus

(4 GB/s)

Video Memory

(512 MB)

GPU (500 MHz)

Video Memory

(512 MB)

GPU (500 MHz)

2 x 1 MB Cache

35.2 GB/s bandwidth6.4 GB/s bandwidth

13The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Bandwidth

Large high bandwidth memory 512 MB video memory vs. 2 MB L2 cache on CPUs

High memory to compute clock ratio – reduces memory stalls

14The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Graphics pipeline

vertex

pixel

texture

image

polygon

per-pixel texture, fp16 blending

programmable vertexprocessing (fp32)

programmable per-pixel math (fp32)

polygon setup,culling, rasterization

Z-buf, fp16 blending,anti-alias (MRT)

memory

15The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Stream processor (non-graphics)(David Kirk, NVIDIA, May’05)

data

setuprasterizer

data

data

data

data fetch, fp16 blending

programmable MIMDprocessing (fp32)

programmable SIMDprocessing (fp32)

listsSIMD“rasterization”

predicated write, fp16blend, multiple output

memory

16The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Potential of graphics processors

Commodity horsepowerParallel computationBandwidth

Programmable graphics pipelineStream processor

Exploit large growth rate

17The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Exploiting technology moving

faster than Moore’s law Source: Anselmo Lastra

CPU Growth Rate

GPU Growth Rate

18The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

General purpose computing on GPUs

Physical SimulationFluid Flow [Fan et al. 2004]

FEM [Rumpf and Strzodka 2001]

Cloud Dynamics [Harris et al. 2003]

Sparse Linear AlgebraOperators [Krüger and Westermann 2003]

Iterative Solvers [Bolz et al. 2003]

19The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

General purpose computing on GPUs

Matrix-Matrix MultiplicationFixed graphics pipeline, fixed-point arithmetic [Larsen and McAllister 2001]

Floating-point (SP) [Fatahalian et al. 2004]

High-level APIBrookGPU [Buck et al. 2004]

Sh [McCool et al. 2004]

20The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Outline

LU Decomposition & Related Work

The potential of GPUs

LU-GPU algorithm

Results

Conclusions & Ongoing Work

21The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Motivation for LU-GPU

LU decomposition maps well:Stream program

Few data dependencies

PivotingParallel pivot search

Exploit large memory bandwidth

22The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

GPU based algorithms

Data representation

Algorithm mapping

23The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Data representation

Texture mapping hardware: Input data mapping

24The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Data representation

Matrix elements 2D texture memory

One-to-one mapping

Texture memory = on-board memoryExploit bandwidth

Avoid CPU-GPU data transfer

25The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

GPU based algorithms

Data representation

Algorithm mappingStream computation

Input data mapping

Fast row swaps

26The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Algorithm mapping

Texture mapping hardware: Input data mapping

27The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Stream computation

Rasterize quadrilateralsGenerates computation streamInvokes SIMD unitsRasterization simulates blocking

Rasterization pass = row elimination

Alternating memory regions

28The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Input data mapping

Texture mapping hardware: Input data mapping

29The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Input data mapping

Dedicated texture mapping hardware

Traditionally for color interpolation

Map input matrix elements to output elements

Eliminates computation of memory locations

25% performance improvement

30The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Pivoting

Main issues:

Pivot search

Row/column swapping

31The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Pivoting

Texture mapping hardware: Input data mapping

32The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Partial pivoting

Fast row swapData copy: mapped rasterization

Texture mapping hardware

High memory bandwidth

Improvement over pointer swapping

TEXTURE MAPPINGHARDWARE

Input Output

33The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Full pivoting

Fast column/row swap

Parallel pivot searchDivide and conquer approach

Partial pivoting Full pivoting

34The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Outline

LU Decomposition & Related Work

The potential of GPUs

LU-GPU algorithm

Results

Conclusions & Ongoing Work

35The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Benchmarks

GPU SIMD units Core clock Memory Memory clock

6800 GT 12 350 MHz

16 425 MHz

430 MHz24

256 Mb 900 MHz

6800 Ultra 256 Mb 1100 MHz

7800 GTX 256 Mb 1200 MHz

Commodity CPU3.4 GHz Pentium IV with Hyper-Threading, 1 MB L2 cache

LAPACK sgetrf() (blocked algorithm, ATLAS library)

LAPACK sgetc2() (SSE-optimized IMKL library)

36The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Results: No pivoting

0

1

2

3

4

5

6

7

8

9

1000 1500 2000 2500 3000 3500Matrix size N

Tim

e (s

)

ATLAS GETRF (Partial Pivot)Ultra 6800 LU (no pivoting)GT 6800 LU (no pivoting)7800 LU (no pivoting)

0

1

2

3

4

5

6

7

8

9

1000 1500 2000 2500 3000 3500Matrix size N

Tim

e (s

)

ATLAS GETRF (Partial Pivot)Ultra 6800 LU (no pivoting)GT 6800 LU (no pivoting)7800 LU (no pivoting)

0

1

2

3

4

5

6

7

8

9

1000 1500 2000 2500 3000 3500Matrix size N

Tim

e (s

)

ATLAS GETRF (Partial Pivot)Ultra 6800 LU (no pivoting)GT 6800 LU (no pivoting)7800 LU (no pivoting)

0

1

2

3

4

5

6

7

8

9

1000 1500 2000 2500 3000 3500Matrix size N

Tim

e (s

)

ATLAS GETRF (Partial Pivot)Ultra 6800 LU (no pivoting)GT 6800 LU (no pivoting)7800 LU (no pivoting)

37The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Results: Partial pivoting

0

2

4

6

8

10

12

500 1000 1500 2000 2500 3000 3500Matrix size N

Tim

e (s

)

ATLAS GETRF (Partial Pivot)GT 6800 Partial PivotUltra 6800 Partial Pivot7800 Partial Pivot

0

2

4

6

8

10

12

500 1000 1500 2000 2500 3000 3500Matrix size N

Tim

e (s

)

ATLAS GETRF (Partial Pivot)GT 6800 Partial PivotUltra 6800 Partial Pivot7800 Partial Pivot

0

2

4

6

8

10

12

500 1000 1500 2000 2500 3000 3500Matrix size N

Tim

e (s

)

ATLAS GETRF (Partial Pivot)GT 6800 Partial PivotUltra 6800 Partial Pivot7800 Partial Pivot

0

2

4

6

8

10

12

500 1000 1500 2000 2500 3000 3500Matrix size N

Tim

e (s

)

ATLAS GETRF (Partial Pivot)GT 6800 Partial PivotUltra 6800 Partial Pivot7800 Partial Pivot

38The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Results: Full Pivoting

0

50

100

150

200

250

500 1000 1500 2000 2500 3000 3500Matrix size N

Tim

e (s

)

Ultra 6800 Full Pivot

LAPACK sgetc2 (IMKL)

7800 Full Pivot

0

50

100

150

200

250

500 1000 1500 2000 2500 3000 3500Matrix size N

Tim

e (s

)

Ultra 6800 Full Pivot

LAPACK sgetc2 (IMKL)

7800 Full Pivot

0

50

100

150

200

250

500 1000 1500 2000 2500 3000 3500Matrix size N

Tim

e (s

)

Ultra 6800 Full Pivot

LAPACK sgetc2 (IMKL)

7800 Full Pivot

39The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Results: Number of computational units

4

0

5

10

15

20

25

30

35

40

45

500 1000 1500 2000 2500 3000 3500 4000Matrix Size (N)

Tim

e (s

)

4

8

0

5

10

15

20

25

30

35

40

45

500 1000 1500 2000 2500 3000 3500 4000Matrix Size (N)

Tim

e (s

)

4

8

12

0

5

10

15

20

25

30

35

40

45

500 1000 1500 2000 2500 3000 3500 4000Matrix Size (N)

Tim

e (s

)

4

8

1216

0

5

10

15

20

25

30

35

40

45

500 1000 1500 2000 2500 3000 3500 4000Matrix Size (N)

Tim

e (s

)

6800 Ultra (no pivoting)(Jun 2003)

(Mar 2004)

40The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

GPU-CPU data transfer overhead

0.00

2.00

4.00

6.00

8.00

10.00

12.00

500 1000 1500 2000 2500 3000 3500Matrix size N

Tim

e (s

)

GT 6800 Partial PivotUltra 6800 Partial Pivot7800 Partial PivotCPU-GPU transfer

41The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Bandwidth efficiency

0

5

10

15

20

25

30

35

500 1000 1500 2000 2500 3000 3500 4000Matrix size (N)

Ban

dwid

th U

sage

(GB

/s)

6800 Ultra 6800 GT6800 Ultra Peak Bandwidth: 35.2 GB/s

6800 GT Peak Bandwidth: 28.8

42The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Faster than Moore’s law

0

2

4

6

8

10

12

500 1000 1500 2000 2500 3000 3500

Matrix size N

Tim

e (s

)

ATLAS GETRF (Partial Pivot)GT 6800 Partial PivotUltra 6800 Partial Pivot7800 Partial Pivot (Mar 2004)

(Jun 2005)

43The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Application: Fluid Simulation

Solve parallel sub-problemsN=2048

Diagonally-dominant

No pivoting required

15% faster than ATLASon Pentium IV 3.06 GHz

44The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Limitations

Maximum matrix size: 4096x4096Block-partitioned LU decomposition

PrecisionSingle precision floating point

Not 100% IEEE floating point compliant

CPU-GPU data transfer overheadSmall matrices

45The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Graphics hardware advancements

Improved floating point bandwidth4 component vs. single component

Floating point blendingUse of non-programmable TFLOPs

46The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Outline

LU Decomposition & Related Work

The potential of GPUs

LU-GPU algorithm

Results

Conclusions & Ongoing Work

47The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Conclusions

Algorithm mapped to graphics pipeline

Novel mapping of row operations to rasterization

Stream computation

Blocking

48The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Conclusions

Optimized with GPU architecture

Input data mapping

Fast pivoting

49The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Conclusions

Performance Comparable to industry-standard libraries

Relatively small development effort

GPU are useful co-processors Scientific computations

Many applications

50The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Conclusions

LU-GPU Open Source library available:

http://gamma.cs.unc.edu/LUGPULIB/

51The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Ongoing work

Sorting on GPUs

Linear algebra: GPU-LAPACK / QR decomposition

52The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Sorting on GPUs

Goal: Utilize the high parallelism, and memory bandwidth on GPUs for fast sorting

[Govindaraju et al, SIGMOD05]

GPUSort: Open Source library[http://gamma.cs.unc.edu/GPUSORT/]

53The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Sorting on GPUs

6 times faster than Quicksort on 3.4 GHz Pentium IV PC!

54The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Linear algebra

LAPACK-compliant library for GPUs

QR-decomposition in development (LAPACK SGEQRF)

55The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Acknowledgements

Army Research OfficeDARPANational Science FoundationOffice of Naval ResearchRDECOMIntel CorporationNVIDIA CorporationUNC GAMMA Group

56The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Thank you

For questions or comments:

nico@cs.unc.edunaga@cs.unc.eduhenson@cs.unc.edu

http://gamma.cs.unc.edu/http://gamma.cs.unc.edu/LUGPULIB/

top related