super computer
TRANSCRIPT
SupercomputerPerformance Characterization
Presented By: IQxplorer
Here are some important computer performance questions
• What key computer system parameters determine performance?
• What synthetic benchmarks can be used to characterize these system parameters?
• How does performance on synthetics compare between computers?
• How does performance on applications compare between computers?
• How does performance scale (i.e., vary with processor count)?
Comparative performance results have been obtainedon six computers at NCSA & SDSC,
all with > 1,000 processors
Computer SiteSystemvendor
Computeprocessors :
p/node
ClockSpeed(GHz)
Processortype
Switchtype
Blue Gene SDSC IBM 2,048: 2 0.7 PowerPCCustom:
3-D torus + tree
Cobalt NCSA SGI 1,024: 512 1.6 Itanium 2NUMAlink 4 +InfiniBand 4x
DataStar SDSC IBM1,408: 8 &
768: 81.5 &
1.7Power4+
Federation:fat tree
Mercury NCSA IBM 1,262: 2 1.5 Itanium 2 Myrinet 2000
Tungsten NCSA Dell 2,560: 2 3.2 Xeon Myrinet 2000
T2 NCSA Dell 1,024: 2 3.6 EM64T InfiniBand 4x
These computers have shared-memory nodesof widely varying size connected by different switch types
• Blue Gene• Massively parallel processor system with low-power, 2p nodes• Two custom switches for point-to-point and collective communication
• Cobalt• Cluster of two large, 512p nodes (also called a constellation)• Custom switch within nodes & commodity switch between nodes
• DataStar• Cluster of 8p nodes• Custom high-performance switch called Federation
• Mercury, Tungsten, & T2• Clusters of 2p nodes• Commodity switches
Performance can be better understood with a simple model
• Total run time can be split into three components:ttot = tcomp + tcomm + tio
• Overlap may exist. If so, it can be handled as follows:tcomp = computation timetcomm = communication time that can’t be overlapped with tcomptio = I/O time that can’t be overlapped with tcomp & tcomm
• Relative values vary depending upon computer, application, problem, & number of processors
Run-time components depend uponsystem parameters & code features
Run-timecomponent
System parameter Code feature
tcompFlop speedMemory bandwidthMemory latency
Computation in cacheStrided memory accessRandom memory access
tcommInterconnect bandwidthInterconnect latency
Large message transfersSmall message transfers
tio I/O rate Large transfers to disk
Differences between point-to-point & collective communication are important too
Compute, communication, & I/O speeds have been measured for many synthetic & application benchmarks
• Synthetic benchmarks• sloops (includes daxpy & dot)• HPL (Linpack)• HPC Challenge• NAS Parallel Benchmarks• IOR
• Application benchmarks• Amber 9 PMEMD (biophysics: molecular dynamics)• …• WRF (atmospheric science: weather prediction)
Normalized memory access profiles for daxpyshow better memory access, but more memory contention
on Blue Gene compared DataStar
0.1
1.0
10.0
1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08 1.E+09
Memory accessed: 16n (B)
Mem
ory
bandw
idth
(m
em
ops/
pro
cess
or-
clock
)
DS 1p/n O4 qnoipa
DS 8p/n O4 qnoipa
BG CO:1p/n O3 440d
BG VN: 2p/n O3 440d
BG CO:1p/n O3 440
daxpy: a(i) = a(i) + s*b(i)
DataStar: 1.5-GHz Power4+ processorsBlue Gene: 700-MHz PowerPC processors
Each HPCC synthetic benchmark measures one or two system parameters in varying combinations
Primary system parameters that determine performance
HPCCbenchmark
Flop speedMemory
bandwidthMemorylatency
Interconnectbandwidth
Interconnectlatency
HPL x x
DGEMM x
STREAM x
PTRANS x x
RandomAccess x x
FFTE x x
bench_lat_bw x x
Relative speeds are shown for HPCC benchmarks on 6 computersat 1,024p; 4 different computers are fastest depending upon benchmark;
2 of these are also slowest, depending upon benchmark
0.1
1.0
10.0
G-HPL G-PTRANS G-FFTE G-RandomAccess
EP-STREAMTriad
EP-DGEMM RandomRing
Bandwidth
RandomRing
Latency
HPCC Benchmark on 1,024p
Sp
eed
rela
tive t
o 1
.5-G
Hz
Data
Sta
r
DataStar: 1.5-GHz Power4+
Cobalt: 1.6-GHz Itanium 2
Mercury: 1.5-GHz Itanium 2
T2: 3.6-GHz EM64T
Tungsten: 3.2-GHz Xeon
Blue Gene: 0.7-GHz PowerPC
Data available soon at CIP Web site: www.ci-partnership.org
Absolute speeds are shown for HPCC & IOR benchmarkson SDSC computers; TG processors are fastest, BG & DS
interconnects are fastest, & all three computers have similar I/O rates
EP-EP- STREAM Ping Pong Ping Pong IOR write IOR read
DGEMM Triad bandwidth latency rate rateComputer (Gflop/s) (GB/s) (GB/s) (µs) (GB/s) (GB/s)
Blue Gene 2.2 0.87 0.16 4 3.4 2.7DataStar 3.7 1.68 1.40 6 3.8 2.0TeraGrid IA-64 5.6 1.90 0.23 11 4.2 3.1
Relative speeds are shown for 5 applications on 6 computersat various processor counts; Cobalt & DataStar are generally fastest
0.1
1.0
10.0
GAMESS lg: 384pMILC lg: 1,024p NAMD ApoA1:512p
PARATEC lg:256p
WRF std: 256p
Application
Sp
eed
rela
tive t
o 1
.5-G
Hz
Data
Sta
r
Cobalt: 1.6-GHz Itanium 2
DataStar: 1.5-GHz Power4+
Mercury: 1.5-GHz Itanium 2
T2: 3.6-GHz EM64T
Tungsten: 3.2-GHz Xeon
Blue Gene: 0.7-GHz PowerPC
Good scaling is essential to take advantageof high processors counts
• Two types of scaling are of interest• Strong: performance vs processor count (p) for fixed problem size • Weak: performance vs p for fixed work per processor
• There are several ways of plotting scaling• Run time (t) vs p• Speed (1/t) vs p• Speed/p vs p
• Scaling depends significantly on the computer, application, & problem
• Use log-log plot to preserve ratios when comparing computers
AWM 512^3 problem shows good strong scaling to 2,048pon Blue Gene & to 512p on DataStar, but not on TeraGrid cluster
Data from Yifeng Cui
AWM 512^3 Execution Time Without I/O
0.01
0.10
1.00
10.00
64 128 256 512 1024 2048
Number of processors
Wal
l Clo
ck t
ime
per
ste
p
DSBGTGDS idealBG idealTG ideal
MILC medium problem shows superlinear speedupon Cobalt, Mercury, & DataStar at small processor counts;strong scaling ends for DataStar & Blue Gene above 2,048p
0.1
1.0
10.0
16 32 64 128 256 512 1,024 2,048
Processors
Sp
eed
/pro
cessor
rela
tive t
o 1
.5-G
Hz
Data
Sta
r at
16
p
Cobalt: 1.6-GHz Itanium 2
Mercury: 1.5-GHz Itanium 2
DataStar: 1.5-GHz Power4+
T2: 3.6-GHz EM64T
Tungsten: 3.2-GHz Xeon
Blue Gene: 0.7-GHz PowerPCMILC medium
NAMD ApoA1 problem scales best on DataStar & Blue Gene;Cobalt is fastest below 512p, but the same speed as DataStar at 512p
0.1
1.0
10.0
16 32 64 128 256 512 1,024 2,048
Processors
Sp
eed
/pro
cessor
rela
tive t
o 1
.5-G
Hz
Data
Sta
r at
16
p
Cobalt: 1.6-GHz Itanium 2
DataStar: 1.5-GHz Power4+
Mercury: 1.5-GHz Itanium 2
Tungsten: 3.2-GHz Xeon
Blue Gene: 0.7-GHz PowerPC
NAMD ApoA1 with 92k atoms
WRF standard problem scales best on DataStar;Cobalt is fastest below 512p, but the same speed as DataStar at 512p
0.1
1.0
10.0
16 32 64 128 256 512 1,024 2,048
Processors
Sp
eed
/pro
cessor
rela
tive t
o 1
.5-G
Hz
Data
Sta
r at
16
p
Cobalt: 1.6-GHz Itanium 2
DataStar: 1.5-GHz Power4+
Mercury: 1.5-GHz Itanium 2
T2: 3.6-GHz EM64T
Tungsten: 3.2-GHz Xeon
WRF standard
Communication fraction generally grows with processor count in strong scaling scans, such as for WRF standard problem on DataStar
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
16 32 64 128 256 512 1,024 2,048
Processors
Com
mu
nic
ati
on
fra
cti
on
DataStar: 1.5-GHz Power4+WRF standard
A more careful look at Blue Gene shows many pluses
+ Hardware is more reliable than for other high-end systems installed at SDSC in recent years
+ Compute times are extremely reproducible+ Networks scale well+ I/O performance with GPFS is good at high p+ Price per peak flop/s is low+ Power per flop/s is low+ Footprint is small
But there are also some minuses
- Processors are relatively slow• Clock speed is 700 MHz• Compilers seldom use second FPU in each processor
(though optimized libraries do)
- Applications must scale well to get high absolute performance
- Memory is only 512 MB/node, so some problems don’t fit• Coprocessor mode can be used (with 1p/node), but this is inefficient• Some problems still don’t fit even in coprocessor mode
- Cross-compiling complicates software development for complex codes
Major applications ported and being run on BG at SDSC span various disciplines
Code name Discipline Description Implementors
Amber 9 PMEMD Biophysics Molecular dynamics Ross Walker (SDSC)
AWM Geophysics 3-D seismic wave Yifeng Cui (SDSC) propagation
DNS (ESSL) Engineering Direct numerical simulation Diego Donzis (Georgia Tech) & of 3-D turbulence Dmitry Pekurovsky (SDSC)
DOT (FFTW) Biophysics Protein docking Susan Lindsey (SDSC) & Wayne Pfeiffer (SDSC)
MILC * Physics Quantum chromodynamics Doug Toussaint (Arizona)
mpcugles Engineering 3-D fluid dynamics Giri Chukkapalli (SDSC)
NAMD 2.6b1 * Biophysics Molecular dynamics Sameer Kumar (IBM) (FFTW)Rosetta * Biophysics Protein folding Ross Walker (SDSC)
SPECFEM3D Geophysics 3-D seismic wave Brian Savage propagation (Carnegie Institution)
* Most heavily used
Speed of BG relative to DataStar varies about clock speed ratio(0.47 = 0.7/1.5) for applications on ≥ 512p;
CO & VN mode perform similarly (per MPI p)
0.1
1.0
Amber 9PMEMD
Cellulose:768p
AWM 512^3w/o I/O:1,024p
DNS (ESSL)1,024^3:1,024p
DOT (FFTW)UDG/UGI 54krots: 512p
MILC large:1,024p
mpcuglesforward prop
w/o I/O:512p
NAMD 2.6b1(FFTW)
ApoA1: 512p
SPECFEM3DTonga-Fiji:
1,024p
Application
Sp
eed
rela
tive t
o 1
.5-G
Hz
Data
Sta
r BG in CO mode
BG in VN mode
Clock speed ratio = 0.47
DNS scaling on BG is generally better than on DataStar,but shows unusual variation;
VN mode is somewhat slower than CO mode (per MPI p)
0.1
1.0
16 32 64 128 256 512 1024 2048
MPI processors
Sp
eed
/pro
cess
or
rela
tive t
o 1
.5-G
Hz
Data
Sta
r
1.5-GHz DataStar
Blue Gene CO mode
Blue Gene VN modeDNS 1024^3
Data from Dmitry Pekurovsky
If number of allocated processors is considered,then VN mode is faster than CO mode,
and both modes show unusual variation
0.1
1.0
16 32 64 128 256 512 1024 2048
Allocated processors
Sp
eed
/pro
cess
or
rela
tive t
o 1
.5-G
Hz
Data
Sta
r
1.5-GHz DataStar
Blue Gene CO mode
Blue Gene VN modeDNS 1024^3
Data from Dmitry Pekurovsky
IOR weak scaling scans using GPFS-WAN show BG in VN modeachieves 3.4 GB/s for writes (~DS) & 2.7 GB/s for reads (>DS)
10
100
1,000
10,000
1 2 4 8 16 32 64 128 256 512 1024 2048
MPI processors
I/O
rate
(M
B/s
)
CO peak CO write CO read VN peak VN write VN read
Blue Gene in CO & VN mode using gpfs-wan (default mapping) Noncollective read/write via IOR (256 or 128 MB/p)(-a POSIX -e -t 1m -b 256m or -b 128m)
Data from 3/7/06
Blue Gene has more limited applicability than DataStar,but is a good choice if the application is right
+ Some applications run relatively fast & scale well+ Turnaround is good with only a few users+ Hardware is reliable & easy to maintain- Other applications run relatively slowly and/or don’t
scale well- Some typical problems need to run in CO mode to fit in
memory- Other typical problems won’t fit at all