uc regents spring 2014 © ucbcs 152 l22: gpu + simd + vectors 2014-4-15 john lazzaro (not a prof -...
TRANSCRIPT
![Page 1: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d985503460f94a82a17/html5/thumbnails/1.jpg)
UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors
2014-4-15
John Lazzaro(not a prof - “John” is always OK)
CS 152Computer Architecture and Engineering
www-inst.eecs.berkeley.edu/~cs152/
TA: Eric Love
Lecture 22 -- GPU + SIMD + Vectors I
Play:
![Page 2: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d985503460f94a82a17/html5/thumbnails/2.jpg)
UC Regents Fall 2006 © UCBCS 152 L22: GPU + SIMD + Vectors
Today: Architecture for data parallelism
The Landscape: Three chips that deliver TeraOps/s in 2014, and how they differ.
GK110: nVidia’s flagship Kepler GPU, customized for compute applications.
Short Break
E5-2600v2: Stretching the Xeon server approach for compute-intensive apps.
![Page 3: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d985503460f94a82a17/html5/thumbnails/3.jpg)
Sony/IBM Playstation PS3 Cell Chip - Released 2006
![Page 4: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d985503460f94a82a17/html5/thumbnails/4.jpg)
Sony PS3 Cell Processor SPE Floating-Point
32-bit 32-bit 32-bit 32-bitSingle-Instruction
Multiple-Data
4
single-precisionmultiply-
addsissue in lockstep(SIMD)
per cycle.6 cycle latency(in blue)
6 gamer SPEs,
3.2 GHz clock,
--> 150 GigaOps/s
![Page 5: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d985503460f94a82a17/html5/thumbnails/5.jpg)
Sony PS3 Cell Processor SPE Floating-Point
32-bit 32-bit 32-bit 32-bitSingle-Instruction
Multiple-DataIn the 1970s a big part
of a computer
architecture class would be learning how to build
units like this.Top-down
(f.p. format)&&
Bottom-up(logic design)
![Page 6: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d985503460f94a82a17/html5/thumbnails/6.jpg)
Sony PS3 Cell Processor SPE Floating-PointThe PS3 ceded ground to Xbox not because it
was underpowered, but because it was hard to program.
Today, the formats are standards (IEEE f.p.)
and the bottom-up is now “EE.”
Architects focus on how to organize
floating point units into
programmable machines
for application domains.
![Page 7: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d985503460f94a82a17/html5/thumbnails/7.jpg)
UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors
2014: TeraOps/Sec Chips
![Page 8: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d985503460f94a82a17/html5/thumbnails/8.jpg)
Intel E5-2600v2
12-core Xeon Ivy Bridge
0.52 TeraOps/s
12 cores @ 2.7 GHzEach core
can issue 16 single-
precisionoperations per cycle.
$2,600 per chip
Haswell: 1.04 TeraOps/s
![Page 9: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d985503460f94a82a17/html5/thumbnails/9.jpg)
EECS 150: Graphics Processors UC Regents Fall 2013 © UCB
nVidia GPU5.12
TeraOps/s
2880 MACs @ 889 MHz
single-precision
multiply-adds
Kepler GK 110
$999
GTX Titan Black with
6GB GDDR5 (and 1 GPU)
![Page 10: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d985503460f94a82a17/html5/thumbnails/10.jpg)
Typical application: Medical imaging scanners, for first stage of processing after the A/D converters.
XC7VX980T
Xilinx Virtex 7 with the most
DSP blocks.
3600 MACs @ 714 MHzComparable
to single-precision
floating-point.
5.14 TeraOps/s
$16,824 per chip
(die photo of a related part)
![Page 11: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d985503460f94a82a17/html5/thumbnails/11.jpg)
Intel E5-2600v2
12 cores @ 2.7 GHz
How?
Haswell coresissue
32/cycle.
12 cores @ 2.7 GHzEach core
can issue 16 single-
precisionops/cycle.
![Page 12: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d985503460f94a82a17/html5/thumbnails/12.jpg)
Die closeup of one Sandy Bridge core
Advanced Vector Extension (AVX) unit
Smaller than L3 cache, but larger than L2 cache.Relative area has increased in
Haswell
![Page 13: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d985503460f94a82a17/html5/thumbnails/13.jpg)
Programmers ModelAVX
IA-32 Nehalem
8 128-bit registers
Each register holds 4 IEEE single-precision floats
The programmers model has many variants, which we will introduce in the slides that
follow
![Page 14: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d985503460f94a82a17/html5/thumbnails/14.jpg)
Example AVX Opcode
VMULPS XMM4 XMM2 XMM3
XMM2
XMM3
XMM4op = *
Multiply two 4-element vectors ofsingle-precision floats, element by element.
New issue every cycle. 5 cycle latency (Haswell).
Aside from its use of a special register set, VMULPS execute like normal IA-32
instructions.
![Page 15: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d985503460f94a82a17/html5/thumbnails/15.jpg)
Sandy Bridge, Haswell
Sandy Bridge extends register set to 256 bits: vectors are twice the
size.
IA-64 AVX/AVX2
has 16 registers
(IA-32: 8)
Haswell adds 3-operand instructions a*b + c
Fused multiply-add (FMA)
2 EX units with FMA --> 2X increase in ops/cycle
![Page 16: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d985503460f94a82a17/html5/thumbnails/16.jpg)
OoO Issue Haswell
(2013)
Haswell sustains 4 micro-op issues per cycle.One possibility:2 for AVX, and 2 for Loads, Stores and book-keeping.
Haswell has two copies of the FMA engine, on separate ports.
![Page 17: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d985503460f94a82a17/html5/thumbnails/17.jpg)
AVX: Not just single-precision floating-pointAVX instruction variants interpret 128-bit
registersas 4 floats, 2 doubles, 16 8-bit integers, etc ...
256-bit version -> double-precision vectors of length 4
![Page 18: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d985503460f94a82a17/html5/thumbnails/18.jpg)
Exception Model
MXCSR: AVX
condition codes
register
Floating-point exceptions: Always a contentious issue in ISA design ...
![Page 19: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d985503460f94a82a17/html5/thumbnails/19.jpg)
Exception Handling
Use MXCSRto configureAVX to halt
program for divide by
zero, etc ...
Or, configure AVX for show must go onsemantics: on error,
results are set to +Inf, -Inf, NaN, ...
![Page 20: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d985503460f94a82a17/html5/thumbnails/20.jpg)
Data movesAVX register file reads pass through a permute
and shuffle networks in both “X” and “Y” dimensions.
Many AVX instructions rely on this feature ...
![Page 21: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d985503460f94a82a17/html5/thumbnails/21.jpg)
Pure data
move opcode.
Or, part of a
math opcode.
![Page 22: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d985503460f94a82a17/html5/thumbnails/22.jpg)
Permutes over 2 sets of 4 fields
of one vector.
Arbitrary data
alignment
Shuffling two vectors.
![Page 23: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d985503460f94a82a17/html5/thumbnails/23.jpg)
Memory System
Gather: Reading non-unit-stride memory locations into arbitrary positions in an AVX register, while minimizing redundant reads.
Values in memory.Specified indices.
Final result.
![Page 24: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d985503460f94a82a17/html5/thumbnails/24.jpg)
Positive observations ...
Best for applications that are a good fit for Xeon’s memory system: Large on-chip caches, up-to-a-TeraByte of DRAM, but only moderate bandwidth requirements to DRAM. Applications that do “a lot of everything” --integer, random-access loads/stores, string ops -- gain access to a significant fraction of a TeraOp/sof floating point, with no context switching.If you’re planning on experimenting with GPUs,you need a Xeon server anyway ...aside from $$$, why not buy a high-core-count variant?
![Page 25: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d985503460f94a82a17/html5/thumbnails/25.jpg)
Negative observations ...
AVX changes each generation, in a backward compatible way, to add the latest features. AVX is difficult for compilers. Ideally, someone has written a library of hand-crafted AVX assembly code that does exactly what you want.Two FMA units per core (50% of issue width) is probably the limit. So, scaling vector size or scaling core count are the only upgrade paths.
0.52 TeraOp/s (Ivy Bridge) << 5.12 TeraOp/s (GK110)And $2700 (chip only) >> $999 (Titan Black card).59.6 GB/s << 336 GB/s (memory bandwidth)
![Page 26: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d985503460f94a82a17/html5/thumbnails/26.jpg)
UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors
Break
Play:
![Page 27: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d985503460f94a82a17/html5/thumbnails/27.jpg)
EECS 150: Graphics Processors UC Regents Fall 2013 © UCB
nVidia GPU
The granularity of SMX
cores (15 per
die)matches the Xeon
core count (12 per
die)
Kepler GK 110
![Page 28: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d985503460f94a82a17/html5/thumbnails/28.jpg)
SMX core(28 nm)
Sandy Bridge core
(32 nm)
![Page 29: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d985503460f94a82a17/html5/thumbnails/29.jpg)
889 MHz GK 110 SMX core vs 2.7 GHz Haswell core
single prec.
double prec.
1024-bit SIMD vectors: 4X more than Haswell32 single-precision floats or 16 double-precision floats
singleprecisio
n
singleprecisio
n
singleprecisio
n
singleprecisio
n
singleprecisio
n
singleprecisio
n
doubleprecisio
n
doubleprecisio
n
specialops
memory ops
Execution units vs. Haswell 3X (single-precision), 1X (double-precision)
Clock speed vs Ivy Bridge Xeon: 3X slower
4X single-precision, 1.33X double-precision
![Page 30: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d985503460f94a82a17/html5/thumbnails/30.jpg)
CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB
Organization: Multi-threaded like Niagara
Thread scheduler
2048 registers in total. Several programmer models available. Largest model has 256 registers per thread, supporting 8 active threads.
![Page 31: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d985503460f94a82a17/html5/thumbnails/31.jpg)
CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB
Organization: Multi-threaded, In-order
Thread scheduler
The SIMD math units live here
Each cycle, 3 threads can issue 2 in-order instructions.
![Page 32: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d985503460f94a82a17/html5/thumbnails/32.jpg)
Bandwidth to DRAM
is 5.6X XeonIvy Bridge
But, DRAM limited to
6GB, and all caches are
small compared
to Xeon
![Page 33: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d985503460f94a82a17/html5/thumbnails/33.jpg)
EECS 150: Graphics Processors UC Regents Fall 2013 © UCB
nVidia GPU5.12
TeraOps/s
Kepler GK 110
$999
GTX Titan Black with
6GB GDDR5 (and 1 GPU)
2880 MACs @ 889 MHz
single-precision
multiply-adds
![Page 34: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d985503460f94a82a17/html5/thumbnails/34.jpg)
On Thursday
To be continued ...
Have fun in section !