iap09 cuda@mit 6.963 - guest lecture: cuda tricks and high-performance computational physics (kipton...

57
CUDA Tricks and Computational Physics Kipton Barros In collaboration with R. Babich, R. Brower, M. Clark, C. Rebbi, J. Ellowitz Boston University

Upload: npinto

Post on 30-Nov-2014

13.469 views

Category:

Education


3 download

DESCRIPTION

More at http://sites.google.com/site/cudaiap2009 and http://pinto.scripts.mit.edu/Classes/CUDAIAP2009Note that some slides were borrowed from NVIDIA.

TRANSCRIPT

Page 1: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

CUDA Tricks and Computational Physics

Kipton Barros

In collaboration with R. Babich, R . Brower, M. Clark, C. Rebbi, J. Ellowitz

Boston University

Page 2: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

High energy physicshuge computational needs

27 km

Large Hadron Collider, CERN

Page 3: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

A disclaimer:

I’m not a high energy physicist

A request:

Please question/comment freely during the talk

Page 4: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

View of the CMS detector at the end of 2007. (Maximilien Brice, © CERN)

Page 5: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

View of the Computer Center during the installation of servers. (Maximilien Brice; Claudia Marcelloni, © CERN)

15 Petabytes to be processed annually

Page 6: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

The “Standard Model” of Particle Physics

Page 7: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

I’ll discuss Quantum ChromoDynamics

Although it’s “standard”, these equations are hard to solve

Big questions: why do quarks appear in groups?physics during big bang?

Page 8: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)
Page 9: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

Quantum ChromoDynamicsThe theory of nuclear

interactions

Extremely difficult:

Must work at the level of fields, not particlesCalculation is quantum mechanical

(bound by “gluons”)

Page 10: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

Lattice QCD:Solving Quantum Chromodynamics by Computer

Discretize space and time (place the quarks and gluons on a 4D lattice)

Page 11: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

Spacetime = 3+1 dimensions

lattice sites

Quarks live on sites (24 floats each)

Gluons live on links (18 floats each)

Total system sizefloat bytes

lattice sites

quarks gluons

! 324 ! 106

! 4! 324 ! (24 + 4! 18) " 384MB

Page 12: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

Lattice QCD:Inner loop requires repeatedly solving linear equation

DW is a sparse matrixwith only nearest neighbor

couplings

quarksgluons

needs to be fast!DW

Page 13: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

DWOperation of

1 output quark site(24 floats)

Page 14: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

DWOperation of

1 output quark site(24 floats)

2x4 input quark sites(24x8 floats)

Page 15: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

1 output quark site(24 floats)

2x4 input quark sites(24x8 floats)

DWOperation of

2x4 input gluon links(18x8 floats)

Page 16: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

1 output quark site(24 floats)

2x4 input quark sites(24x8 floats)

DWOperation of

2x4 input gluon links(18x8 floats)

1.4 kB of local storage required per quark update?

Page 17: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

Cuda Parallelization:Must process many quark updates simultaneously

Odd/even sites processed separately

Page 18: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

© NVIDIA Corporation 2006 3

Programming Model

A kernel is executed as a grid of thread blocks

A thread block is a batch of threads that can cooperate with each other by:

Sharing data through shared memory

Synchronizing their execution

Threads from different blocks cannot cooperate

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Grid 2

Block (1, 1)

Thread

(0, 1)

Thread

(1, 1)

Thread

(2, 1)

Thread

(3, 1)

Thread

(4, 1)

Thread

(0, 2)

Thread

(1, 2)

Thread

(2, 2)

Thread

(3, 2)

Thread

(4, 2)

Thread

(0, 0)

Thread

(1, 0)

Thread

(2, 0)

Thread

(3, 0)

Thread

(4, 0)

Threading

Friday, January 23, 2009

Page 19: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

parallelization:Each thread processes 1 siteNo communication required between threads!All threads in warp execute same code

DW

Page 20: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

Step 1: Read neighbor site

Page 21: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

Step 1: Read neighbor site

Step 2: Read neighbor link

Page 22: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

Step 1: Read neighbor site

Step 2: Read neighbor link

Step 3: Accumulate into

Page 23: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

Step 1: Read neighbor site

Step 2: Read neighbor link

Step 3: Accumulate into

Step 4: Read neighbor site

Page 24: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

Step 1: Read neighbor site

Step 2: Read neighbor link

Step 3: Accumulate into

Step 4: Read neighbor site

Step 5: Read neighbor link

Page 25: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

Step 1: Read neighbor site

Step 2: Read neighbor link

Step 3: Accumulate into

Step 4: Read neighbor site

Step 5: Read neighbor link

Step 6: Accumulate into

Page 26: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

79

!""#$%&"'

()*+%,-.&/0*#"0.1&/-%*+-+2+"#0+,-/+3#+&0.%44'5-/1-

+2+"#0.&6-10)+*-7%*$/-./-0)+-1&4'-7%'-01-).,+-

4%0+&".+/-%&,-8++$-0)+-)%*,7%*+-9#/'

!""#$%&"' :-;#<9+*-1=-7%*$/-*#&&.&6-

"1&"#**+&04'-1&-%-<#40.$*1"+//1*-,.>.,+,-9'-

<%2.<#<-&#<9+*-1=-7%*$/-0)%0-"%&-*#&-

"1&"#**+&04'

?.<.0+,-9'-*+/1#*"+-#/%6+@

A+6./0+*/

B)%*+,-<+<1*'

Exec

Friday, January 23, 2009

!!!

Page 27: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

85

!"#$%$&$'()#*+,-./)",+)01234

5*22/,)#*+,-./)",+)01234)-/)-)%61#$"1,)27)8-+")/$&,

9:2$.)8-/#$'()32%"6#-#$2')2')6'.,+;"2"61-#,.)8-+"/

<2+,)#*+,-./)",+)01234)==)0,##,+)%,%2+>)1-#,'3>)

*$.$'(

?6#@)%2+,)#*+,-./)",+)01234)==)7,8,+)+,($/#,+/)",+)

#*+,-.

A,+',1)$':23-#$2'/)3-')7-$1)$7)#22)%-'>)+,($/#,+/)-+,)6/,.

B,6+$/#$3/

<$'$%6%C)DE)#*+,-./)",+)01234

!'1>)$7)%61#$"1,)32'36++,'#)01234/)

FGH)2+)HID)#*+,-./)-)0,##,+)3*2$3,

J/6-11>)/#$11),'26(*)+,(/)#2)32%"$1,)-'.)$':24,)/633,//7611>

K*$/)-11).,",'./)2')>26+)32%"6#-#$2'@)/2),L"+$%,'#M

Exec

Friday, January 23, 2009

Page 28: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

Reminder -- each multiprocessor has:

16 kb shared memory

16 k registers

1024 active threads (max)

High occupancy needed for maximum performance (roughly 25% or so)

Page 29: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

: does it fit onto the GPU?DW

Each thread requires 0.2 kb1.4 kbof fast local memory

24 12 floats

18 floats

24 floats

Page 30: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

: does it fit onto the GPU?DW

Each thread requires 0.2 kb1.4 kbof fast local memory

MP has

Threads/MP = 16 / 0.2 = 80

16 kb shared mem

Page 31: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

: does it fit onto the GPU?DW

Each thread requires 0.2 kb1.4 kbof fast local memory

MP has

Threads/MP = 16 / 0.2 = 80

16 kb shared mem

64(multiple of 64 only)

Page 32: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

: does it fit onto the GPU?DW

Each thread requires 0.2 kb1.4 kbof fast local memory

MP has

Threads/MP = 16 / 0.2 = 80

16 kb shared mem

64(multiple of 64 only)

MP occupancy = 64/1024 = 6%

Page 33: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

6% occupancysounds pretty

bad!

Andreas Kuehn / Getty

Page 34: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

Reminder -- each multiprocessor has:

16 kb shared memory

16 k registers

1024 active threads (max)

Each thread requires 0.2 kbof fast local memory

How can we get better occupancy?

Page 35: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

Reminder -- each multiprocessor has:

16 kb shared memory

16 k registers = 64 kb memory

1024 active threads

Each thread requires 0.2 kbof fast local memory

How can we get better occupancy?

Occupancy > 25%

Page 36: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

Registers as data(possible because no inter-thread communication)

Instead of shared memory

Registers are allocated as

Page 37: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

Registers as data

Can’t be indexed. All loops must be EXPLICITLY expanded

Page 38: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

Code sample

(approx. 1000 LOC automatically generated)

Page 39: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

Performance Results:

82 Gigabytes/sec (GTX 280)

44 Gigabytes/sec (Tesla C870)

(completely bandwidth limited)

For comparison:

twice as fast as Cell impl. (arXiv:0804.3654)

20 times faster than CPU implementations

(90 Gflops/s)

Page 40: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

0

11.25

22.50

33.75

45.00

≥ 25% 17% 8% 0%

GB/s vs Occupancy

Tesla C870

Surprise! Very robust to low occupancy

0

21.25

42.50

63.75

85.00

≥ 19% 13% 6% 0%

GB/s GB/s

GTX 280

Occupancy Occupancy

Page 41: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

Device memory is the bottleneckCoalesced memory accesses crucial

q11 , q12 , ...q124

Quark 1 Quark 2 Quark 3

q21 , q22 , ...q224 q31 , q32 , ...q324 ...

... ...q31q21q11 q12 q22 q32

Data reordering

thread 0 thread 2thread 1 ...

Page 42: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

Memory coalescing: store even/odd lattices separately

Page 43: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

When memory access isn’t perfectly coalesced

Sometimes float4 arrays can hide latency

This global memory read corresponds to a single CUDA

instruction

thread 0 thread 2thread 1

In case of coalesce miss, at least 4x data is transfered

Page 44: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

When memory access isn’t perfectly coalesced

Binding to textures can help

This makes use of the texture cache and can reduce penalty for nearly coalesced accesses

corresponds to a single CUDA instruction

Page 45: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

Regarding textures, there are two kinds of memory:

Linear array

Can be modified in kernel

“Cuda array”

Can’t be modifed in kernelGets reordered for 2D, 3D locality

Can only be bound to 1D texture

Allows various hardware features

Page 46: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

When a CUDA array is bound to a 2D texture, it is probably reordered to something like a Z-curve

Wikipedia image

This gives 2D locality

Page 47: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

Warnings:

The effectiveness of float4, textures, depends on the CUDA hardware and driver (!)

Certain “magic” access patterns are many times faster than others

Testing appears to be necessary

Page 48: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

Memory bandwidth test

Should be optimal

Simple kernel

Memory access completely coalesced

Page 49: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

Memory bandwidth test

Simple kernel

Memory access completely coalesced

Bandwidth: 54 Gigabytes / sec(GTX 280, 140 GB/s theoretical!)

Page 50: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

So why are NVIDIA samples so fast?

NVIDIA actually uses

54 Gigabytes / sec 102 Gigabytes / sec

(GTX 280, 140 GB/s theoretical)

Page 51: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

Naive access pattern

Block 1

...Step 1

Block 2

Block 1

...Step 2

Block 2

...

...

Page 52: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

Modified access pattern

Block 1

...Step 1

Block 2

Block 1

...Step 2

Block 2

...

...

(much more efficient)

Page 53: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

CUDA Compiler

CUDAC code

PTXcode

CUDA machinecode

Use unofficial CUDA disassembler to view CUDA machine code

CUDA disassembly

(LOTS of optimization

here)

Page 54: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

CUDA Disassembler (decuda)

Compile and save cubin file

foo.cu

Disassemble

Page 55: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

Look how CUDA implements integer

division!

Page 56: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

CUDA provides fast (but imperfect) trigonometry in hardware!

Page 57: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

The compiler is very aggressive in optimization. It will group memory loads together to minimize latency

Notice: each thread reads 20 floats!

(snippet from LQCD)