iap09 cuda@mit 6.963 - guest lecture: cuda tricks and high-performance computational physics (kipton...
DESCRIPTION
More at http://sites.google.com/site/cudaiap2009 and http://pinto.scripts.mit.edu/Classes/CUDAIAP2009Note that some slides were borrowed from NVIDIA.TRANSCRIPT
CUDA Tricks and Computational Physics
Kipton Barros
In collaboration with R. Babich, R . Brower, M. Clark, C. Rebbi, J. Ellowitz
Boston University
High energy physicshuge computational needs
27 km
Large Hadron Collider, CERN
A disclaimer:
I’m not a high energy physicist
A request:
Please question/comment freely during the talk
View of the CMS detector at the end of 2007. (Maximilien Brice, © CERN)
View of the Computer Center during the installation of servers. (Maximilien Brice; Claudia Marcelloni, © CERN)
15 Petabytes to be processed annually
The “Standard Model” of Particle Physics
I’ll discuss Quantum ChromoDynamics
Although it’s “standard”, these equations are hard to solve
Big questions: why do quarks appear in groups?physics during big bang?
Quantum ChromoDynamicsThe theory of nuclear
interactions
Extremely difficult:
Must work at the level of fields, not particlesCalculation is quantum mechanical
(bound by “gluons”)
Lattice QCD:Solving Quantum Chromodynamics by Computer
Discretize space and time (place the quarks and gluons on a 4D lattice)
Spacetime = 3+1 dimensions
lattice sites
Quarks live on sites (24 floats each)
Gluons live on links (18 floats each)
Total system sizefloat bytes
lattice sites
quarks gluons
! 324 ! 106
! 4! 324 ! (24 + 4! 18) " 384MB
Lattice QCD:Inner loop requires repeatedly solving linear equation
DW is a sparse matrixwith only nearest neighbor
couplings
quarksgluons
needs to be fast!DW
DWOperation of
1 output quark site(24 floats)
DWOperation of
1 output quark site(24 floats)
2x4 input quark sites(24x8 floats)
1 output quark site(24 floats)
2x4 input quark sites(24x8 floats)
DWOperation of
2x4 input gluon links(18x8 floats)
1 output quark site(24 floats)
2x4 input quark sites(24x8 floats)
DWOperation of
2x4 input gluon links(18x8 floats)
1.4 kB of local storage required per quark update?
Cuda Parallelization:Must process many quark updates simultaneously
Odd/even sites processed separately
© NVIDIA Corporation 2006 3
Programming Model
A kernel is executed as a grid of thread blocks
A thread block is a batch of threads that can cooperate with each other by:
Sharing data through shared memory
Synchronizing their execution
Threads from different blocks cannot cooperate
Host
Kernel 1
Kernel 2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(2, 0)
Block(0, 1)
Block(1, 1)
Block(2, 1)
Grid 2
Block (1, 1)
Thread
(0, 1)
Thread
(1, 1)
Thread
(2, 1)
Thread
(3, 1)
Thread
(4, 1)
Thread
(0, 2)
Thread
(1, 2)
Thread
(2, 2)
Thread
(3, 2)
Thread
(4, 2)
Thread
(0, 0)
Thread
(1, 0)
Thread
(2, 0)
Thread
(3, 0)
Thread
(4, 0)
Threading
Friday, January 23, 2009
parallelization:Each thread processes 1 siteNo communication required between threads!All threads in warp execute same code
DW
Step 1: Read neighbor site
Step 1: Read neighbor site
Step 2: Read neighbor link
Step 1: Read neighbor site
Step 2: Read neighbor link
Step 3: Accumulate into
Step 1: Read neighbor site
Step 2: Read neighbor link
Step 3: Accumulate into
Step 4: Read neighbor site
Step 1: Read neighbor site
Step 2: Read neighbor link
Step 3: Accumulate into
Step 4: Read neighbor site
Step 5: Read neighbor link
Step 1: Read neighbor site
Step 2: Read neighbor link
Step 3: Accumulate into
Step 4: Read neighbor site
Step 5: Read neighbor link
Step 6: Accumulate into
79
!""#$%&"'
()*+%,-.&/0*#"0.1&/-%*+-+2+"#0+,-/+3#+&0.%44'5-/1-
+2+"#0.&6-10)+*-7%*$/-./-0)+-1&4'-7%'-01-).,+-
4%0+&".+/-%&,-8++$-0)+-)%*,7%*+-9#/'
!""#$%&"' :-;#<9+*-1=-7%*$/-*#&&.&6-
"1&"#**+&04'-1&-%-<#40.$*1"+//1*-,.>.,+,-9'-
<%2.<#<-&#<9+*-1=-7%*$/-0)%0-"%&-*#&-
"1&"#**+&04'
?.<.0+,-9'-*+/1#*"+-#/%6+@
A+6./0+*/
B)%*+,-<+<1*'
Exec
Friday, January 23, 2009
!!!
85
!"#$%$&$'()#*+,-./)",+)01234
5*22/,)#*+,-./)",+)01234)-/)-)%61#$"1,)27)8-+")/$&,
9:2$.)8-/#$'()32%"6#-#$2')2')6'.,+;"2"61-#,.)8-+"/
<2+,)#*+,-./)",+)01234)==)0,##,+)%,%2+>)1-#,'3>)
*$.$'(
?6#@)%2+,)#*+,-./)",+)01234)==)7,8,+)+,($/#,+/)",+)
#*+,-.
A,+',1)$':23-#$2'/)3-')7-$1)$7)#22)%-'>)+,($/#,+/)-+,)6/,.
B,6+$/#$3/
<$'$%6%C)DE)#*+,-./)",+)01234
!'1>)$7)%61#$"1,)32'36++,'#)01234/)
FGH)2+)HID)#*+,-./)-)0,##,+)3*2$3,
J/6-11>)/#$11),'26(*)+,(/)#2)32%"$1,)-'.)$':24,)/633,//7611>
K*$/)-11).,",'./)2')>26+)32%"6#-#$2'@)/2),L"+$%,'#M
Exec
Friday, January 23, 2009
Reminder -- each multiprocessor has:
16 kb shared memory
16 k registers
1024 active threads (max)
High occupancy needed for maximum performance (roughly 25% or so)
: does it fit onto the GPU?DW
Each thread requires 0.2 kb1.4 kbof fast local memory
24 12 floats
18 floats
24 floats
: does it fit onto the GPU?DW
Each thread requires 0.2 kb1.4 kbof fast local memory
MP has
Threads/MP = 16 / 0.2 = 80
16 kb shared mem
: does it fit onto the GPU?DW
Each thread requires 0.2 kb1.4 kbof fast local memory
MP has
Threads/MP = 16 / 0.2 = 80
16 kb shared mem
64(multiple of 64 only)
: does it fit onto the GPU?DW
Each thread requires 0.2 kb1.4 kbof fast local memory
MP has
Threads/MP = 16 / 0.2 = 80
16 kb shared mem
64(multiple of 64 only)
MP occupancy = 64/1024 = 6%
6% occupancysounds pretty
bad!
Andreas Kuehn / Getty
Reminder -- each multiprocessor has:
16 kb shared memory
16 k registers
1024 active threads (max)
Each thread requires 0.2 kbof fast local memory
How can we get better occupancy?
Reminder -- each multiprocessor has:
16 kb shared memory
16 k registers = 64 kb memory
1024 active threads
Each thread requires 0.2 kbof fast local memory
How can we get better occupancy?
Occupancy > 25%
Registers as data(possible because no inter-thread communication)
Instead of shared memory
Registers are allocated as
Registers as data
Can’t be indexed. All loops must be EXPLICITLY expanded
Code sample
(approx. 1000 LOC automatically generated)
Performance Results:
82 Gigabytes/sec (GTX 280)
44 Gigabytes/sec (Tesla C870)
(completely bandwidth limited)
For comparison:
twice as fast as Cell impl. (arXiv:0804.3654)
20 times faster than CPU implementations
(90 Gflops/s)
0
11.25
22.50
33.75
45.00
≥ 25% 17% 8% 0%
GB/s vs Occupancy
Tesla C870
Surprise! Very robust to low occupancy
0
21.25
42.50
63.75
85.00
≥ 19% 13% 6% 0%
GB/s GB/s
GTX 280
Occupancy Occupancy
Device memory is the bottleneckCoalesced memory accesses crucial
q11 , q12 , ...q124
Quark 1 Quark 2 Quark 3
q21 , q22 , ...q224 q31 , q32 , ...q324 ...
... ...q31q21q11 q12 q22 q32
Data reordering
thread 0 thread 2thread 1 ...
Memory coalescing: store even/odd lattices separately
When memory access isn’t perfectly coalesced
Sometimes float4 arrays can hide latency
This global memory read corresponds to a single CUDA
instruction
thread 0 thread 2thread 1
In case of coalesce miss, at least 4x data is transfered
When memory access isn’t perfectly coalesced
Binding to textures can help
This makes use of the texture cache and can reduce penalty for nearly coalesced accesses
corresponds to a single CUDA instruction
Regarding textures, there are two kinds of memory:
Linear array
Can be modified in kernel
“Cuda array”
Can’t be modifed in kernelGets reordered for 2D, 3D locality
Can only be bound to 1D texture
Allows various hardware features
When a CUDA array is bound to a 2D texture, it is probably reordered to something like a Z-curve
Wikipedia image
This gives 2D locality
Warnings:
The effectiveness of float4, textures, depends on the CUDA hardware and driver (!)
Certain “magic” access patterns are many times faster than others
Testing appears to be necessary
Memory bandwidth test
Should be optimal
Simple kernel
Memory access completely coalesced
Memory bandwidth test
Simple kernel
Memory access completely coalesced
Bandwidth: 54 Gigabytes / sec(GTX 280, 140 GB/s theoretical!)
So why are NVIDIA samples so fast?
NVIDIA actually uses
54 Gigabytes / sec 102 Gigabytes / sec
(GTX 280, 140 GB/s theoretical)
Naive access pattern
Block 1
...Step 1
Block 2
Block 1
...Step 2
Block 2
...
...
Modified access pattern
Block 1
...Step 1
Block 2
Block 1
...Step 2
Block 2
...
...
(much more efficient)
CUDA Compiler
CUDAC code
PTXcode
CUDA machinecode
Use unofficial CUDA disassembler to view CUDA machine code
CUDA disassembly
(LOTS of optimization
here)
CUDA Disassembler (decuda)
Compile and save cubin file
foo.cu
Disassemble
Look how CUDA implements integer
division!
CUDA provides fast (but imperfect) trigonometry in hardware!
The compiler is very aggressive in optimization. It will group memory loads together to minimize latency
Notice: each thread reads 20 floats!
(snippet from LQCD)