2-hpc perspective on acceleratorshpcuniversity.org/media/trainingmaterials/11/2-hpc... · 6 leibniz...
TRANSCRIPT
A Quick HPC Perspective on Accelerators
Vincent Betro Computational Scientist
University of Tennessee National Institute for Computational Sciences
Copyright 2013
Things started looking harder around 2004…
Source: published SPECInt data
Moore’s Law is not at all dead…
High Volume Manufacturing 2004 2006 2008 2010 2012 2014 2016 2018
Feature Size 90nm 65nm 45nm 32nm 22nm 16nm 11nm 8nm
Integration Capacity (Billions of Transistors)
2 4 8 16 32 64 128 256
Intel process technology capabilities
50nm
Transistor for 90nm Process
Source: Intel Influenza Virus
Source: CDC
The problem in 2004…didn’t get better.
0
1
10
100
1,000
1975
1980
1985
1990
1995
2000
2005
2010
2015
2020
Watts�pe
r�Squ
are�c
m
Historical�Single�Core Historical�MultiͲCoreITRS�Hi�Perf
100W�lightbulb
Hot�Plate
Nuclear�Reactor
1.E+01
1.E+02
1.E+03
1.E+04
1/1/92
1/1/96
1/1/00
1/1/04
1/1/08
1/1/12
1/1/16
Clock�(
MHz
)
Node�Processor�Clock�(MHz)�(H)Node�Processor�Clock�(MHz)�(L)Node�Processor�Clock�(MHz)�(M)Acclerator�Core�Clock�(MHz)�(H)Acclerator�Core�Clock�(MHz)�(L)Acclerator�Core�Clock�(MHz)�(M)
Source: Kogge and Shalf, IEEE CISE Courtesy Horst Simon, LBNL
Not a new problem, just a new scale…
CPU Power
W)
Cray-2 with cooling tower in foreground, circa 1985
How to get same number of transistors to give us more performance without cranking up power?
C1
C4
C2
C3
Small core
Big core
Cache
Cache
1
2
3
4
1 2
1 1
1
2
3
4
1
2
3
4
Power
Performance
Power = ¼
Performance = 1/2
Many core is more power efficient
Power ~ area
Single thread performance ~ area**.5
Example: Dual core with voltage scaling
Area = 1 Voltage = 1 Freq = 1 Power = 1 Perf = 1
Area = 2 Voltage = 0.85 Freq = 0.85 Power = 1 Perf = ~1.8
Frequency Reduction
Power Reduction
Performance Reduction
15% 45% 10%
A 15% Reduction In Voltage
Yields
SINGLE CORE DUAL CORE
RULE OF THUMB
GPU Architecture – Kepler: Streaming Multiprocessor (SMX)
! 192 SP CUDA Cores per SMX ! 192 fp32 ops/clock ! 192 int32 ops/clock
! 64 DP CUDA Cores per SMX ! 64 fp64 ops/clock
! 4 warp schedulers ! Up to 2048 threads concurrently
! 32 special-function units ! 64KB shared mem + L1 cache ! 48KB Read-Only Data cache ! 64K 32-bit registers
Intel’s MIC Approach
Since the days of RISC vs. CISC, Intel has mastered the art of figuring out what is important about a new processing technology, and saying “why can’t we do this in x86?” The Intel Many Integrated Core (MIC) architecture is about large die, simpler circuit, much more parallelism, in the x86 line.
Courtesy Dan Stanzione, TACC
MIC Architecture
• Many cores on the die • L1 and L2 cache • Bidirectional ring network
for L2 Memory and PCIe connection
Courtesy Dan Stanzione, TACC
What makes an Accelerator?
PCI Bus
CPU Memory GPU Memory
CPU GPU Intel Xeon Phi
Top 10 Systems in November 2012 (Just got updated last month)
# Site Manufacturer Computer Country Cores Rmax [Pflops]
Power [MW]
1 Oak Ridge National Laboratory Cray Titan Cray XK7, Opteron 16C 2.2GHz, Gemini, NVIDIA K20x USA 560,640 17.6 8.21
2 Lawrence Livermore National Laboratory IBM Sequoia BlueGene/Q, Power BQC 16C 1.6GHz, Custom
USA 1,572,864 16.3 7.89
3 RIKEN Advanced Institute for Computational Science Fujitsu
K Computer SPARC64 VIIIfx 2.0GHz, Tofu Interconnect
Japan 795,024 10.5 12.66
4 Argonne National Laboratory IBM Mira BlueGene/Q, Power BQC 16C 1.6GHz, Custom
USA 786,432 8.16 3.95
5 Forschungszentrum Juelich (FZJ) IBM JuQUEEN BlueGene/Q, Power BQC 16C 1.6GHz, Custom
Germany 393,216 4.14 1.97
6 Leibniz Rechenzentrum IBM SuperMUC iDataPlex DX360M4, Xeon E5 8C 2.7GHz, Infiniband FDR
Germany 147,456 2.90 3.52
7 Texas Advanced Computing Center/UT Dell Stampede PowerEdge C8220, Xeon E5 8C 2.7GHz, Intel Xeon Phi
USA 204,900 2.66
8 National SuperComputer Center in Tianjin NUDT
Tianhe-1A NUDT TH MPP, Xeon 6C, NVidia, FT-1000 8C
China 186,368 2.57 4.04
9 CINECA IBM Fermi BlueGene/Q, Power BQC 16C 1.6GHz, Custom
Italy 163,840 1.73 0.82
10 IBM IBM DARPA Trial Subset Power 775, Power7 8C 3.84GHz, Custom
USA 63,360 1.52 3.57
• Upgrade of Jaguar from Cray XT5 to XK6 • Cray Linux Environment operating system
• Gemini interconnect 3-D Torus Globally addressable memory Advanced synchronization features
• AMD Opteron 6274 processors (Interlagos)
• New accelerated node design using NVIDIA multi-core accelerators 2011: 960 NVIDIA x2090 “Fermi” GPUs 2012: 14,592 NVIDIA “Kepler” GPUs
• 20+ PFlops peak system performance
• 600 TB DDR3 mem. + 88 TB GDDR5 mem
Titan System at Oak Ridge National Laboratory (Just displaced by Chinese system using Xeon Phi)
Power Efficiency over Time
0
500
1,000
1,500
2,000
2,500
3,000
2008 2009 2010 2011 2012
Linp
ack/Po
wer [Gflo
ps/kW]
TOP10
TOP50
TOP500
Accelerator and BG
multicore
Data from: TOP500 November 2012
Courtesy Horst Simon, LBNL
Projected Performance Development
Data from: TOP500 November 2012
0.1 1
10 100 1000 10000
100000 1000000 10000000
100000000 1E+09 1E+10 1E+11
1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
SUM
N=1
N=500
1 Gflop/s
1 Tflop/s
100 Mflop/s
100 Gflop/s
100 Tflop/s
10 Gflop/s
10 Tflop/s
1 Pflop/s
100 Pflop/s
10 Pflop/s
1 Eflop/s
Courtesy Horst Simon, LBNL
Trends with ends.
0.50
0.75
1.00
1.25
1.50
1.75
2.00
2.25
01/01/96
01/01/00
01/01/04
01/01/08
01/01/12
Comp
ound
Annu
alGrow
thRa
te:CAG
R
Rmax (Gflop/s) Total Cores
Ave Cycles/sec per core (Mhz) Mem/Core (GB)
Source: Kogge and Shalf, IEEE CISE Courtesy Horst Simon, LBNL
Summary
Heterogeneous computing is an important piece of the computing future. GPU accelerators are here to stay. Now, let’s see how you can use them…