gpu computing
TRANSCRIPT
![Page 1: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/1.jpg)
Parallel Computing on GPUs
Christian Kehl01.01.2011
![Page 2: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/2.jpg)
2
Overview
• Basics of Parallel Computing• Brief History of SIMD vs. MIMD
Architectures• OpenCL• Common Application Domain• Monte Carlo-Study of a Spring-Mass-
System using OpenCL and OpenMP
![Page 3: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/3.jpg)
3
Basics of Parallel Computing
Ref.: René Fink, „Untersuchungen zur Parallelverarbeitung mit wissenschaftlich-technischen Berechnungsumgebungen“, Diss Uni Rostock 2007
![Page 4: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/4.jpg)
4
Basics of Parallel Computing
![Page 5: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/5.jpg)
5
Overview
• Basics of Parallel Computing• Brief History of SIMD vs. MIMD
Architectures• OpenCL• Common Application Domain• Monte Carlo-Study of a Spring-Mass-
System using OpenCL and OpenMP
![Page 6: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/6.jpg)
6
Brief History of SIMD vs. MIMD Architectures
![Page 7: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/7.jpg)
7
Brief History of SIMD vs. MIMD Architectures
![Page 8: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/8.jpg)
8
Brief History of SIMD vs. MIMD Architectures
![Page 9: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/9.jpg)
9
Brief History of SIMD vs. MIMD Architectures
• 2004 – programmable GPU Core via Shader Technology• 2007 – CUDA (Compute Unified Device
Architecture) Release 1.0• December 2008 – First Open Compute
Language Spec• March 2009 – Uniform Shader, first BETA
Releases of OpenCL• August 2009 – Release and Implementation
of OpenCL 1.0
![Page 10: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/10.jpg)
10
Brief History of SIMD vs. MIMD Architectures
• SIMD technologies in GPUs:– Vector processing (ILLIAC IV)–mathematical operation units (ILLIAC IV)– Pipelining (CRAY-1)– local memory caching (CRAY-1)– atomic instructions (CRAY-1)– synchronized instruction execution and memory
access (MASPAR)
![Page 11: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/11.jpg)
11
Overview
• Basics of Parallel Computing• Brief History of SIMD vs. MIMD
Architectures• OpenCL• Common Application Domain• Monte Carlo-Study of a Spring-Mass-
System using OpenCL and OpenMP
![Page 12: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/12.jpg)
12
Platform Model
• One Host + one or more Compute Devices• Each Compute Device is composed of one or
more Compute Units• Each Compute Unit is further divided into one or
more Processing Elements
OpenCL
![Page 13: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/13.jpg)
13
• Total number of work-items = Gx * Gy
• Size of each work-group = Sx * Sy
• Global ID can be computed from work-group ID and local ID
Kernel ExecutionOpenCL
![Page 14: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/14.jpg)
14
Memory ManagementOpenCL
![Page 15: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/15.jpg)
15
Memory ManagementOpenCL
![Page 16: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/16.jpg)
16
• Address spaces– Private - private to a work-item– Local - local to a work-group– Global - accessible by all work-items in all work-
groups– Constant - read only global space
Memory ModelOpenCL
![Page 17: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/17.jpg)
17
Programming LanguageOpenCL
• Every GPU Computing technology natively written in C/C++ (Host)
• Host-Code Bindings to several other languages are existing (Fortran, Java, C#, Ruby)
• Device Code exclusively written in standard C + Extensions
![Page 18: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/18.jpg)
18
• Pointers to functions not allowed• Pointers to pointers allowed within a kernel, but not
as an argument• Bit-fields not supported• Variable-length arrays and structures not supported• Recursion not supported• Writes to a pointer of types less than 32-bit not
supported• Double types not supported, but reserved• 3D Image writes not supported
• Some restrictions are addressed through extensions
Language Restrictions
OpenCL
![Page 19: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/19.jpg)
19
Overview
• Basics of Parallel Computing• Brief History of SIMD vs. MIMD
Architectures• OpenCL• Common Application Domain• Monte Carlo-Study of a Spring-Mass-
System using OpenCL and OpenMP
![Page 20: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/20.jpg)
20
• Multimedia Data and Tasks best-suited for SIMD Processing
• Multimedia Data – sequential Bytestreams; each Byte independent
• Image Processing in particular suited for GPUs
• original GPU task: „Compute <several FLOP> for every Pixel of the screen“ ( Computer Graphics)
• same task for images, only FLOP‘s are different
Common Application Domain
![Page 21: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/21.jpg)
21
• possible features realizable on the GPU– contrast- and luminance configuration– gamma scaling– (pixel-by-pixel-) histogram scaling– convolution filtering– edge highlighting– negative image / image inversion– …
Common Application Domain – Image Processing
![Page 22: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/22.jpg)
22
• simple example: Inversion• implementation and use of a framework for
switching between different GPGPU technologies• creation of a command queue for each GPU• reading GPU kernel via kernel file on-the-fly• creation of buffers for input and output image• memory copy of input image data to global GPU
memory• set of kernel arguments and kernel execution• memory copy of GPU output buffer data to new
image
InversionImage Processing
![Page 23: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/23.jpg)
23
evaluated and confirmed minimum speedup – G80 GPU OpenCL VS. 8-core-CPU OpenMP
4 : 1
Image Processing Inversion
![Page 24: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/24.jpg)
25
Overview
• Basics of Parallel Computing• Brief History of SIMD vs. MIMD
Architectures• OpenCL• Common Application Domain• Monte Carlo-Study of a Spring-Mass-
System using OpenCL and OpenMP
![Page 25: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/25.jpg)
26
MC Study of a SMS using OpenCL and OpenMP
• Task• Modelling• Euler as simple ODE solver• Existing MIMD Solutions• An SIMD-Approach• OpenMP• Result Plots• Speed-Up-Study• Parallization Conclusions• Resumée
![Page 26: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/26.jpg)
27
Task
• Spring-Mass-System defined by a differential equation
• Behavior of the system must be simulated over varying damping values
• Therefore: numerical solution in t; tε[0.0 … 2] sec. for a stepsize h=1/1000
• Analysis of computation time and speed-up for different compute architectures
![Page 27: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/27.jpg)
28
Task
based on Simulation News Europe (SNE) CP2:• 1000 simulation iterations over simulation
horizon with generated damping values (Monte-Carlo Study)
• consequtive averaging for s(t)• tε[0 … 2] sec; h=0.01 200 steps
![Page 28: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/28.jpg)
29
Task
on present architectures too lightweighted-> Modification:
• 5000 iterations with Monte-Carlo• h=0.001 2000 steps
Aim of Analysis: Knowledge about spring behavior for different damping values (trajectory array)
![Page 29: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/29.jpg)
30
Task• Simple Spring-Mass-System
d … damping constantc … spring constant
• Movement equation derived by Newton‘s 2nd axiom
• Modelling needed -> „Massenfreischnitt“– mass is moved– force balancing Equation
![Page 30: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/30.jpg)
31
MC Study of a SMS using OpenCL and OpenMP
• Task• Modelling• Euler as simple ODE solver• Existing MIMD Solutions• An SIMD-Approach• OpenMP• Result Plots• Speed-Up-Study• Parallization Conclusions• Resumée
![Page 31: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/31.jpg)
32
Modelling
• numerical integration based on 2nd order differential equation
DE order n n DEs 1st order
order 2nd DE -
)()/()'()/(')'(
)()'(')'(
FFF
0
:axiomNewton 2.
CDT
tsmktsmdts
tcstdstms
FFF TDC
)(')'(
)()'(
)()(
tats
tvts
tsts
![Page 32: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/32.jpg)
33
Modelling• Transformation by substitution
122
2
21
21
)/()/('
)()/()'()/(')'()'(
)'()'(
')(),()(
)()/()'()/(')'(
smcsmds
tsmctsmdtsts
ststs
stststs
tsmctsmdts
• random damping parameter d for interval limits [800;1200]; • 5000 iterations
m 0 s(0)
m/s 0.1v(0)(0)s'
:esstart valu
2s t0s;t
450;9000
:CP2 SNEby given
Endstart
kgmc
![Page 33: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/33.jpg)
34
MC Study of a SMS using OpenCL and OpenMP
• Task• Modelling• Euler as simple ODE solver• Existing MIMD Solutions• An SIMD-Approach• OpenMP• Result Plots• Speed-Up-Study• Parallization Conclusions• Resumée
![Page 34: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/34.jpg)
35
Euler as simple ODE solver
• numerical integration by explicit Euler method
...
);()(
);()(
);()()(
)(
:Lsg
's and t, s esstart valu
System ODE !
s(t) tajectory?
2223
1112
000110
00
000
stfhsts
stfhsts
stfhsstshts
sts
')()(
')()(
)()/()()/('
)('
- steps allover iterate
:Problem-Mass-Springfor Use
222
111
122
21
shtshts
shtshts
tsmctsmds
tss
![Page 35: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/35.jpg)
36
MC Study of a SMS using OpenCL and OpenMP
• Task• Modelling• Euler as simple ODE solver• Existing MIMD Solutions• An SIMD-Approach• OpenMP• Result Plots• Speed-Up-Study• Parallization Conclusions• Resumée
![Page 36: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/36.jpg)
37
existing MIMD Solutions
![Page 37: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/37.jpg)
38
existing MIMD Solutions
• Approach can not be applied to GPU Architectures
• MIMD-Requirements:– each PE with own instruction flow– each PE can access RAM individually
• GPU Architecture -> SIMD– each PE computes the same instruction at the
same time– each PE has to be at the same instruction for
accessing RAM
Therefore: Development SIMD-Approach
![Page 38: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/38.jpg)
39
MC Study of a SMS using OpenCL and OpenMP
• Task• Modelling• Euler as simple ODE solver• Existing MIMD Solutions• An SIMD-Approach• OpenMP• Result Plots• Speed-Up-Study• Parallization Conclusions• Resumée
![Page 39: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/39.jpg)
40
An SIMD Approach
• S.P./R.F.:– simultaneous execution of sequential
Simulation with varying d-Parameter on spatially distributed PE‘s
– Averaging dependend on trajectories
• C.K.:– simultaneous computation with all d-
Parameters for time tn, iterative repetition until tend
– Averaging dependend on steps
![Page 40: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/40.jpg)
41
An SIMD-Approach
![Page 41: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/41.jpg)
42
MC Study of a SMS using OpenCL and OpenMP
• Task• Modelling• Euler as simple ODE solver• Existing MIMD Solutions• An SIMD-Approach• OpenMP• Result Plots• Speed-Up-Study• Parallization Conclusions• Resumée
![Page 42: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/42.jpg)
43
OpenMP
• Parallization Technology based on shared memory principle
• synchronization hidden for developer• thread management controlable• For System-V-based OS:
– parallization by process forking
• For Windows-based OS:– parallization by WinThread creation (AMD
Study/Intel Tech Paper)
![Page 43: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/43.jpg)
44
OpenMP
• in C/C++: pragma-based preprocessor directives
• in C# represented by ParallelLoops• more than just parallizing Loops (AMD Tech
Report)• Literature:
– AMD/Intel Tech Papers– Thomas Rauber, „Parallele Programmierung“– Barbara Chapman, „Using OpenMP: Portable
Shared Memory Parallel Programming“
![Page 44: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/44.jpg)
45
MC Study of a SMS using OpenCL and OpenMP
• Task• Modelling• Euler as simple ODE solver• Existing MIMD Solutions• An SIMD-Approach• OpenMP• Result Plot• Speed-Up-Study• Parallization Conclusions• Resumée
![Page 45: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/45.jpg)
46
Result Plot
resulting trajectory for all technologies
![Page 46: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/46.jpg)
47
MC Study of a SMS using OpenCL and OpenMP
• Task• Modelling• Euler as simple ODE solver• Existing MIMD Solutions• An SIMD-Approach• OpenMP• Result Plots• Speed-Up-Study• Parallization Conclusions• Resumée
![Page 47: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/47.jpg)
48
Speed-Up Study
# Cores MIMD Single
MIMD OpenMP
SIMD Single SIMD OpenMP
SIMD OpenCL
1 1.0 (T=56.53)
1.0 0.9 (T=64.63)
0.9 0.4 (T=144.6)
2 X 1.8 X 1.4 X
4 X 3.5 X 2.0 X
8 X 5.7 X 1.7 X
16 X 5.1 X 0.5 X
dyn/std 1.0 5.7 0.9 1.7 0.4
OpenMP – own Study – Comparison CPU/GPUSIMD Single: presented SIMD approach on CPUSIMD OpenMP: presented SIMD approach parallized on CPUSIMD OpenCL: Control of number of executing units not possible,
therefore only 1 value
![Page 48: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/48.jpg)
49
Speed-Up StudySIMD OpenCL SIMD single MIMD single SIMD
OpenMPMIMD OpenMP
![Page 49: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/49.jpg)
50
MC Study of a SMS using OpenCL and OpenMP
• Task• Modelling• Euler as simple ODE solver• Existing MIMD Solutions• An SIMD-Approach• OpenMP• Result Plots• Speed-Up-Study• Parallization Conclusions• Resumée
![Page 50: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/50.jpg)
51
Parallization Conclusions
• problem unsuited for SIMD parallization• On-GPU-Reduction too time expensive, Therefore:
– Euler computation on GPU– Average computation on CPU
• most time intensive operation: MemCopy between GPU and Main Memory
• for more complex problems oder different ODE solver procedure speed-up behavior can change
![Page 51: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/51.jpg)
52
Parallization Conclusion• MIMD-Approach S.P./R.F. efficient for SNE CP2• OpenMP realization for MIMD- and SIMD-
Approach possible (and done)• OpenMP MIMD realization almost linear
speedup• more set Threads than PEs physically
available leads to significant Thread-Overhead
• OpenMP chooses automatically number threads to physical available PEs for dynamic assignement
![Page 52: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/52.jpg)
53
MC Study of a SMS using OpenCL and OpenMP
• Task• Modelling• Euler as simple ODE solver• Existing MIMD Solutions• An SIMD-Approach• OpenMP• Result Plots• Speed-Up-Study• Parallization Conclusions• Resumée
![Page 53: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/53.jpg)
54
Resumée
• task can be solved on CPUs and GPUs• For GPU Computing new approaches and
algorithm porting required• although GPUs have massive number of
parallel operating cores, speed-up not for every application domain possible
![Page 54: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/54.jpg)
55
Resumée• Advantages GPU Computing:
– for suited problems (e.g. Multimedia) very fast and scalable
– cheap HPC technology in comparison to scientific supercomputers
– energy-efficient– massive computing power in small size
• Disadvantage GPU Computing:– limited instruction set– strictly SIMD– SIMD Algorithm development hard– no execution supervision (e.g. segmentation/page
fault)
![Page 55: GPU Computing](https://reader038.vdocuments.net/reader038/viewer/2022102815/554f8da9b4c905d25b8b5003/html5/thumbnails/55.jpg)
56
Overview
• Basics of Parallel Computing• Brief History of SIMD vs. MIMD
Architectures• OpenCL• Common Application Domain• Monte Carlo-Study of a Spring-Mass-
System using OpenCL and OpenMP