![Page 1: General introduction: GPUs and the realm of parallel …parallel.vub.ac.be/~jgcornel/gpu/introduction.pdf · · 2015-08-14Parallel computer! 9 . ... Pipeline Design 14 • Typically](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab6d3f87f8b9a1a048e4054/html5/thumbnails/1.jpg)
Personal Super Computing Competence Center
General introduction: GPUs and the realm of parallel architectures
GPU Computing Training
August 17-19th 2015
![Page 2: General introduction: GPUs and the realm of parallel …parallel.vub.ac.be/~jgcornel/gpu/introduction.pdf · · 2015-08-14Parallel computer! 9 . ... Pipeline Design 14 • Typically](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab6d3f87f8b9a1a048e4054/html5/thumbnails/2.jpg)
Pag. Personal Super Computing Competence Center
Jan Lemeire ([email protected])
• Graduated as Engineer in 1994 at VUB
• Worked for 4 years for 2 IT-consultancy companies
• 2000-2007: PhD at the VUB while teaching as assistant
• Subject: probabilistic models for the performance analysis of parallel programs
• Since 2008: postdoc en parttime professor at VUB, department of electronics and informatics (ETRO)
• Teaching ‘Informatics’ for first-year bachelors; ‘parallel systems’ and ‘advanced computer architecture’ to masters
• Since 2012: also teaching for engineers industrial sciences (‘industrial engineers’)
• Projects, papers, phd students in parallel processing (performance analysis, GPU computing) & data mining/machine learning (probabilistic models, causality, learning algorithms)
• http://parallel.vub.ac.be
2
![Page 3: General introduction: GPUs and the realm of parallel …parallel.vub.ac.be/~jgcornel/gpu/introduction.pdf · · 2015-08-14Parallel computer! 9 . ... Pipeline Design 14 • Typically](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab6d3f87f8b9a1a048e4054/html5/thumbnails/3.jpg)
A bit of History
![Page 4: General introduction: GPUs and the realm of parallel …parallel.vub.ac.be/~jgcornel/gpu/introduction.pdf · · 2015-08-14Parallel computer! 9 . ... Pipeline Design 14 • Typically](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab6d3f87f8b9a1a048e4054/html5/thumbnails/4.jpg)
Pag. Personal Super Computing Competence Center
The first computer, mechanical
4
Charles Babbage
1791-1871
![Page 5: General introduction: GPUs and the realm of parallel …parallel.vub.ac.be/~jgcornel/gpu/introduction.pdf · · 2015-08-14Parallel computer! 9 . ... Pipeline Design 14 • Typically](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab6d3f87f8b9a1a048e4054/html5/thumbnails/5.jpg)
Pag. Personal Super Computing Competence Center
Babbage Difference Engine made with LEGO
• http://acarol.woz.org/
This machine can evaluate polynomials of the form Ax2 + Bx + C for x=0, 1, 2, …n with 3 digit results.
5
![Page 6: General introduction: GPUs and the realm of parallel …parallel.vub.ac.be/~jgcornel/gpu/introduction.pdf · · 2015-08-14Parallel computer! 9 . ... Pipeline Design 14 • Typically](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab6d3f87f8b9a1a048e4054/html5/thumbnails/6.jpg)
Pag. Personal Super Computing Competence Center
The first informatician
• Describes what software is
• She brings the insight that a computer goes beyond plain calculations
• She writes the first algorithm/program
Ada Lovelace
1815 – 1852
6
![Page 7: General introduction: GPUs and the realm of parallel …parallel.vub.ac.be/~jgcornel/gpu/introduction.pdf · · 2015-08-14Parallel computer! 9 . ... Pipeline Design 14 • Typically](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab6d3f87f8b9a1a048e4054/html5/thumbnails/7.jpg)
Pag. Personal Super Computing Competence Center
7
![Page 8: General introduction: GPUs and the realm of parallel …parallel.vub.ac.be/~jgcornel/gpu/introduction.pdf · · 2015-08-14Parallel computer! 9 . ... Pipeline Design 14 • Typically](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab6d3f87f8b9a1a048e4054/html5/thumbnails/8.jpg)
Pag. Personal Super Computing Competence Center
ENIAC
First computer: WWII
John Mauchly and John Eckert, 1945 8
![Page 9: General introduction: GPUs and the realm of parallel …parallel.vub.ac.be/~jgcornel/gpu/introduction.pdf · · 2015-08-14Parallel computer! 9 . ... Pipeline Design 14 • Typically](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab6d3f87f8b9a1a048e4054/html5/thumbnails/9.jpg)
Pag. Personal Super Computing Competence Center
Programming = rewiring,
changing the hardware
Parallel computer! 9
![Page 10: General introduction: GPUs and the realm of parallel …parallel.vub.ac.be/~jgcornel/gpu/introduction.pdf · · 2015-08-14Parallel computer! 9 . ... Pipeline Design 14 • Typically](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab6d3f87f8b9a1a048e4054/html5/thumbnails/10.jpg)
Pag. Personal Super Computing Competence Center
Von Neumann rethinks the computer
John Von Neumann
10 10
![Page 11: General introduction: GPUs and the realm of parallel …parallel.vub.ac.be/~jgcornel/gpu/introduction.pdf · · 2015-08-14Parallel computer! 9 . ... Pipeline Design 14 • Typically](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab6d3f87f8b9a1a048e4054/html5/thumbnails/11.jpg)
Pag. Personal Super Computing Competence Center
The Von Neumann-architecture
11
![Page 12: General introduction: GPUs and the realm of parallel …parallel.vub.ac.be/~jgcornel/gpu/introduction.pdf · · 2015-08-14Parallel computer! 9 . ... Pipeline Design 14 • Typically](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab6d3f87f8b9a1a048e4054/html5/thumbnails/12.jpg)
Pag. Personal Super Computing Competence Center
Execute program step-by-step
23
12
![Page 13: General introduction: GPUs and the realm of parallel …parallel.vub.ac.be/~jgcornel/gpu/introduction.pdf · · 2015-08-14Parallel computer! 9 . ... Pipeline Design 14 • Typically](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab6d3f87f8b9a1a048e4054/html5/thumbnails/13.jpg)
Pag. Personal Super Computing Competence Center
For acceleration: pipeline
• Long operations
• Combination of short operations
• Pipelining
1 2 3 4
1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4
time
IF
ID
OF
EX
1
1
1
1 2
2
2
2
3
3
3
3
4
4
4
4
13
![Page 14: General introduction: GPUs and the realm of parallel …parallel.vub.ac.be/~jgcornel/gpu/introduction.pdf · · 2015-08-14Parallel computer! 9 . ... Pipeline Design 14 • Typically](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab6d3f87f8b9a1a048e4054/html5/thumbnails/14.jpg)
Pag. Personal Super Computing Competence Center
Pipeline Design
14
• Typically five tasks in instruction execution
• IF: instruction fetch
• ID: instruction decode
• OF: operand fetch
• EX: instruction execution
• OS: operand store, often called write-back WB
14
![Page 15: General introduction: GPUs and the realm of parallel …parallel.vub.ac.be/~jgcornel/gpu/introduction.pdf · · 2015-08-14Parallel computer! 9 . ... Pipeline Design 14 • Typically](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab6d3f87f8b9a1a048e4054/html5/thumbnails/15.jpg)
Pag. Personal Super Computing Competence Center
Superscalar out-of-order pipeline
15
3 instructions simultaneously
(=pipeline width)
Independent instructions
can overtake each other
15
![Page 16: General introduction: GPUs and the realm of parallel …parallel.vub.ac.be/~jgcornel/gpu/introduction.pdf · · 2015-08-14Parallel computer! 9 . ... Pipeline Design 14 • Typically](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab6d3f87f8b9a1a048e4054/html5/thumbnails/16.jpg)
Pag. Personal Super Computing Competence Center
‘Sequential’ processor: super-scalar out-of-order pipeline
Pipeline depth
Pipeline width
Different processing units
Out-of-order execution
Branch prediction
Register renaming
…
16
![Page 17: General introduction: GPUs and the realm of parallel …parallel.vub.ac.be/~jgcornel/gpu/introduction.pdf · · 2015-08-14Parallel computer! 9 . ... Pipeline Design 14 • Typically](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab6d3f87f8b9a1a048e4054/html5/thumbnails/17.jpg)
Now we are
computing
sequentially!
![Page 18: General introduction: GPUs and the realm of parallel …parallel.vub.ac.be/~jgcornel/gpu/introduction.pdf · · 2015-08-14Parallel computer! 9 . ... Pipeline Design 14 • Typically](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab6d3f87f8b9a1a048e4054/html5/thumbnails/18.jpg)
Parallel
computing?
![Page 19: General introduction: GPUs and the realm of parallel …parallel.vub.ac.be/~jgcornel/gpu/introduction.pdf · · 2015-08-14Parallel computer! 9 . ... Pipeline Design 14 • Typically](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab6d3f87f8b9a1a048e4054/html5/thumbnails/19.jpg)
Pag. Personal Super Computing Competence Center
Super computer: BlueGene/L
• IBM 2007
• 65.536 dual core nodes • E.g. one processor dedicated to
communication, other to computation
• Each 512 MB RAM
• No 8 in Top 500 Supercomputer list (2010) • www.top500.org
19
![Page 20: General introduction: GPUs and the realm of parallel …parallel.vub.ac.be/~jgcornel/gpu/introduction.pdf · · 2015-08-14Parallel computer! 9 . ... Pipeline Design 14 • Typically](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab6d3f87f8b9a1a048e4054/html5/thumbnails/20.jpg)
Pag. Personal Super Computing Competence Center
Clusters
• Made from commodity parts
• or blade servers
• Open-source software available
20
![Page 21: General introduction: GPUs and the realm of parallel …parallel.vub.ac.be/~jgcornel/gpu/introduction.pdf · · 2015-08-14Parallel computer! 9 . ... Pipeline Design 14 • Typically](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab6d3f87f8b9a1a048e4054/html5/thumbnails/21.jpg)
Pag. Personal Super Computing Competence Center
Distributed-Memory Architectures
• Each process got his own local memory
• Communication through messages
• Process is in control
CPU
M
CPU
M
CPU
MN
21
![Page 22: General introduction: GPUs and the realm of parallel …parallel.vub.ac.be/~jgcornel/gpu/introduction.pdf · · 2015-08-14Parallel computer! 9 . ... Pipeline Design 14 • Typically](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab6d3f87f8b9a1a048e4054/html5/thumbnails/22.jpg)
Pag. Personal Super Computing Competence Center
BlueGene/L communication networks
(a) 3D torus (64x32x32) for standard
interprocessor data transfer
• Cut-through routing (see later)
(b) collective network for fast evaluation of
reductions.
(c) Barrier network by a common wire
(a)
22
![Page 23: General introduction: GPUs and the realm of parallel …parallel.vub.ac.be/~jgcornel/gpu/introduction.pdf · · 2015-08-14Parallel computer! 9 . ... Pipeline Design 14 • Typically](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab6d3f87f8b9a1a048e4054/html5/thumbnails/23.jpg)
Pag. Personal Super Computing Competence Center
• The ability to send and receive messages is all we need
• void send(message, destination)
• message receive(source)
• boolean probe(source)
Message-passing
23
![Page 24: General introduction: GPUs and the realm of parallel …parallel.vub.ac.be/~jgcornel/gpu/introduction.pdf · · 2015-08-14Parallel computer! 9 . ... Pipeline Design 14 • Typically](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab6d3f87f8b9a1a048e4054/html5/thumbnails/24.jpg)
Multicore
computing
![Page 25: General introduction: GPUs and the realm of parallel …parallel.vub.ac.be/~jgcornel/gpu/introduction.pdf · · 2015-08-14Parallel computer! 9 . ... Pipeline Design 14 • Typically](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab6d3f87f8b9a1a048e4054/html5/thumbnails/25.jpg)
Pag. Personal Super Computing Competence Center
Shared Address-space Architectures
• Example: multiprocessors
CPU CPU
M
CPU
25
![Page 26: General introduction: GPUs and the realm of parallel …parallel.vub.ac.be/~jgcornel/gpu/introduction.pdf · · 2015-08-14Parallel computer! 9 . ... Pipeline Design 14 • Typically](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab6d3f87f8b9a1a048e4054/html5/thumbnails/26.jpg)
Pag. Personal Super Computing Competence Center
AMD Barcelona: 4 processor cores
26
![Page 27: General introduction: GPUs and the realm of parallel …parallel.vub.ac.be/~jgcornel/gpu/introduction.pdf · · 2015-08-14Parallel computer! 9 . ... Pipeline Design 14 • Typically](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab6d3f87f8b9a1a048e4054/html5/thumbnails/27.jpg)
Pag. Personal Super Computing Competence Center
Thread / core
• A different thread per core
• Each thread can run independently
Multi-threaded programming
• Thread synchronization necessary
• Multiple threads per core also possible
• Context switches necessary
• Hardware threads/hyperthreading
27
![Page 28: General introduction: GPUs and the realm of parallel …parallel.vub.ac.be/~jgcornel/gpu/introduction.pdf · · 2015-08-14Parallel computer! 9 . ... Pipeline Design 14 • Typically](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab6d3f87f8b9a1a048e4054/html5/thumbnails/28.jpg)
Pag. Personal Super Computing Competence Center
Latency Hiding
28
Faster CPU More threads
28
4 cores: x4 Latency hiding: x3
![Page 29: General introduction: GPUs and the realm of parallel …parallel.vub.ac.be/~jgcornel/gpu/introduction.pdf · · 2015-08-14Parallel computer! 9 . ... Pipeline Design 14 • Typically](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab6d3f87f8b9a1a048e4054/html5/thumbnails/29.jpg)
Next level
![Page 30: General introduction: GPUs and the realm of parallel …parallel.vub.ac.be/~jgcornel/gpu/introduction.pdf · · 2015-08-14Parallel computer! 9 . ... Pipeline Design 14 • Typically](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab6d3f87f8b9a1a048e4054/html5/thumbnails/30.jpg)
Pag. Personal Super Computing Competence Center
30
GPU vs CPU Peak Performance Trends
GPU peak performance has grown aggressively.
Hardware has kept up with Moore’s law
Source : NVIDIA
2010 350 Million triangles/second 3 Billion transistors GPU
1995 5,000 triangles/second 800,000 transistors GPU
30
![Page 31: General introduction: GPUs and the realm of parallel …parallel.vub.ac.be/~jgcornel/gpu/introduction.pdf · · 2015-08-14Parallel computer! 9 . ... Pipeline Design 14 • Typically](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab6d3f87f8b9a1a048e4054/html5/thumbnails/31.jpg)
Pag. Personal Super Computing Competence Center
Parallel processors
Courtesy of
GPUs (arrays)
31
![Page 32: General introduction: GPUs and the realm of parallel …parallel.vub.ac.be/~jgcornel/gpu/introduction.pdf · · 2015-08-14Parallel computer! 9 . ... Pipeline Design 14 • Typically](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab6d3f87f8b9a1a048e4054/html5/thumbnails/32.jpg)
Pag. Personal Super Computing Competence Center
GPU Architecture
streaming multiprocessor
global memory
32
![Page 33: General introduction: GPUs and the realm of parallel …parallel.vub.ac.be/~jgcornel/gpu/introduction.pdf · · 2015-08-14Parallel computer! 9 . ... Pipeline Design 14 • Typically](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab6d3f87f8b9a1a048e4054/html5/thumbnails/33.jpg)
Pag. Personal Super Computing Competence Center
1 Streaming Multiprocessor
hardware threads grouped by 32 threads (warps), executing in lockstep (SIMD)
In each cycle, a warp is selected of which the next instruction is put in the pipeline
The Same Instruction is
executed on Multiple Data (SIMD)
width of pipeline: 8 - 32
33
![Page 34: General introduction: GPUs and the realm of parallel …parallel.vub.ac.be/~jgcornel/gpu/introduction.pdf · · 2015-08-14Parallel computer! 9 . ... Pipeline Design 14 • Typically](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab6d3f87f8b9a1a048e4054/html5/thumbnails/34.jpg)
Pag. Personal Super Computing Competence Center
Why are GPUs faster?
Devote transistors to… computation
34
![Page 35: General introduction: GPUs and the realm of parallel …parallel.vub.ac.be/~jgcornel/gpu/introduction.pdf · · 2015-08-14Parallel computer! 9 . ... Pipeline Design 14 • Typically](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab6d3f87f8b9a1a048e4054/html5/thumbnails/35.jpg)
Pag. Personal Super Computing Competence Center
• ±24 stages (old), now ± 8
• in-order execution!!
• no branch prediction!!
• no forwarding!!
• no register renaming!!
• Memory system:
• relatively small
• Until recently no caching
• On the other hand: much more registers (see later)
GPU processor pipeline
35
![Page 36: General introduction: GPUs and the realm of parallel …parallel.vub.ac.be/~jgcornel/gpu/introduction.pdf · · 2015-08-14Parallel computer! 9 . ... Pipeline Design 14 • Typically](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab6d3f87f8b9a1a048e4054/html5/thumbnails/36.jpg)
Pag. Personal Super Computing Competence Center
Multi-Threading (MT) possibilities
Context switch
36
![Page 37: General introduction: GPUs and the realm of parallel …parallel.vub.ac.be/~jgcornel/gpu/introduction.pdf · · 2015-08-14Parallel computer! 9 . ... Pipeline Design 14 • Typically](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab6d3f87f8b9a1a048e4054/html5/thumbnails/37.jpg)
Pag. Personal Super Computing Competence Center
Processing power not for free
Obstacle 1 Hard(er) to implement
Obstacle 2 Hard(er) to get efficiency
37
![Page 38: General introduction: GPUs and the realm of parallel …parallel.vub.ac.be/~jgcornel/gpu/introduction.pdf · · 2015-08-14Parallel computer! 9 . ... Pipeline Design 14 • Typically](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab6d3f87f8b9a1a048e4054/html5/thumbnails/38.jpg)
Algorithm
Implementation
Compiler
Automatic
optimization
Low latency of
each instruction!
Write once
Run everywhere
efficiently!
CPU computing
automatic
manual
![Page 39: General introduction: GPUs and the realm of parallel …parallel.vub.ac.be/~jgcornel/gpu/introduction.pdf · · 2015-08-14Parallel computer! 9 . ... Pipeline Design 14 • Typically](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab6d3f87f8b9a1a048e4054/html5/thumbnails/39.jpg)
Computer science
is not about
computers
![Page 40: General introduction: GPUs and the realm of parallel …parallel.vub.ac.be/~jgcornel/gpu/introduction.pdf · · 2015-08-14Parallel computer! 9 . ... Pipeline Design 14 • Typically](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab6d3f87f8b9a1a048e4054/html5/thumbnails/40.jpg)
Application
Identification of compute-intensive
parts
Feasibility study of GPU acceleration
GPU implementation
GPU optimization
Hardware
Skeleton-based
OpenCL
Pragma-based
Automatic parallelization
Programmability solutions
libraries
![Page 41: General introduction: GPUs and the realm of parallel …parallel.vub.ac.be/~jgcornel/gpu/introduction.pdf · · 2015-08-14Parallel computer! 9 . ... Pipeline Design 14 • Typically](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab6d3f87f8b9a1a048e4054/html5/thumbnails/41.jpg)
41
![Page 42: General introduction: GPUs and the realm of parallel …parallel.vub.ac.be/~jgcornel/gpu/introduction.pdf · · 2015-08-14Parallel computer! 9 . ... Pipeline Design 14 • Typically](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab6d3f87f8b9a1a048e4054/html5/thumbnails/42.jpg)
Computer science
is not about
computers
![Page 43: General introduction: GPUs and the realm of parallel …parallel.vub.ac.be/~jgcornel/gpu/introduction.pdf · · 2015-08-14Parallel computer! 9 . ... Pipeline Design 14 • Typically](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab6d3f87f8b9a1a048e4054/html5/thumbnails/43.jpg)
Optimization
Compiler
Algorithms
Implementation
performance programmability
portability
Challenges of GPU computing
![Page 44: General introduction: GPUs and the realm of parallel …parallel.vub.ac.be/~jgcornel/gpu/introduction.pdf · · 2015-08-14Parallel computer! 9 . ... Pipeline Design 14 • Typically](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab6d3f87f8b9a1a048e4054/html5/thumbnails/44.jpg)
Intel Xeon Phi
![Page 45: General introduction: GPUs and the realm of parallel …parallel.vub.ac.be/~jgcornel/gpu/introduction.pdf · · 2015-08-14Parallel computer! 9 . ... Pipeline Design 14 • Typically](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab6d3f87f8b9a1a048e4054/html5/thumbnails/45.jpg)
Pag. Personal Super Computing Competence Center
Single Instruction Multiple Data (SIMD)
• Operate elementwise on vectors of data • E.g., MMX and SSE instructions in x86
• Multiple data elements in 128-bit wide registers
• Data has to be moved explicitly to/from vector registers
• All processors execute the same instruction at the same time • Instruction has to be fetched only once
Instructions can be performed at once
on all elements of vector registers
7 8 2 -1
3 -3 5 -7
10 5 7 -8
128-bit vector registers
+
45
![Page 46: General introduction: GPUs and the realm of parallel …parallel.vub.ac.be/~jgcornel/gpu/introduction.pdf · · 2015-08-14Parallel computer! 9 . ... Pipeline Design 14 • Typically](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab6d3f87f8b9a1a048e4054/html5/thumbnails/46.jpg)
Pag. Personal Super Computing Competence Center
Vector processors (SIMD)
• Highly pipelined function units
• Stream data from/to vector registers to units
• Data collected from memory into registers
• Results stored from registers to memory
• Has long be viewed as the solution for high-performance computing
• Why always repeating the same instructions (on different data)? => just apply the instruction immediately on all data
• However: difficult to program
• Is SIMT (OpenCL) a better alternative??
46
![Page 47: General introduction: GPUs and the realm of parallel …parallel.vub.ac.be/~jgcornel/gpu/introduction.pdf · · 2015-08-14Parallel computer! 9 . ... Pipeline Design 14 • Typically](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab6d3f87f8b9a1a048e4054/html5/thumbnails/47.jpg)
Pag. Personal Super Computing Competence Center
Intel’s Xeon Phi coprocessor
Intel’s response to GPUs…
60 cores
RAM
ring network
47
![Page 48: General introduction: GPUs and the realm of parallel …parallel.vub.ac.be/~jgcornel/gpu/introduction.pdf · · 2015-08-14Parallel computer! 9 . ... Pipeline Design 14 • Typically](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab6d3f87f8b9a1a048e4054/html5/thumbnails/48.jpg)
Pag. Personal Super Computing Competence Center
Intel’s Xeon Phi’s core
4 hardware
threads
512-bit
Vector unit
(SIMD)
Thread scheduler
48
![Page 49: General introduction: GPUs and the realm of parallel …parallel.vub.ac.be/~jgcornel/gpu/introduction.pdf · · 2015-08-14Parallel computer! 9 . ... Pipeline Design 14 • Typically](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab6d3f87f8b9a1a048e4054/html5/thumbnails/49.jpg)
Pag. Personal Super Computing Competence Center
Vectorization needed for peak performance!!
_m512 register
_mm512_add operations
Automatic vectorization only
works in simple cases
49
![Page 50: General introduction: GPUs and the realm of parallel …parallel.vub.ac.be/~jgcornel/gpu/introduction.pdf · · 2015-08-14Parallel computer! 9 . ... Pipeline Design 14 • Typically](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab6d3f87f8b9a1a048e4054/html5/thumbnails/50.jpg)
Conclusions
![Page 51: General introduction: GPUs and the realm of parallel …parallel.vub.ac.be/~jgcornel/gpu/introduction.pdf · · 2015-08-14Parallel computer! 9 . ... Pipeline Design 14 • Typically](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab6d3f87f8b9a1a048e4054/html5/thumbnails/51.jpg)
Pag. Personal Super Computing Competence Center
51
The third pillar of the scientific world
![Page 52: General introduction: GPUs and the realm of parallel …parallel.vub.ac.be/~jgcornel/gpu/introduction.pdf · · 2015-08-14Parallel computer! 9 . ... Pipeline Design 14 • Typically](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab6d3f87f8b9a1a048e4054/html5/thumbnails/52.jpg)
Parallel Programming Paradigms
52
Message-passing
MPI
Explicit multi-
threading
OpenCL/CUDA
OpenMP
Explicit vector instructions
PROC
M
PROC
M
PROC
MN
PROC PROC
M
PROC
P P P P
M
P P P P
M
M
PROC PROC PROC
M
Distributed memory Shared memory
coarse-grain parallelism
fine-grain parallelism
SIMD
coarse-grain parallelism
SIMT
![Page 53: General introduction: GPUs and the realm of parallel …parallel.vub.ac.be/~jgcornel/gpu/introduction.pdf · · 2015-08-14Parallel computer! 9 . ... Pipeline Design 14 • Typically](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab6d3f87f8b9a1a048e4054/html5/thumbnails/53.jpg)
Pag. Personal Super Computing Competence Center
Hardware? Software?
• You need to have insight into the hardware!
• No universal hardware/programming model (yet)
• Intel-approach (SIMD)
• Intel sticks to x86 architecture
• That’s what programmers know & they won’t change
• Vectorization necessary
• OpenCL-approach (SIMT)
• Will semi-abstract model remain valid?
53