scalability considerations for compute intensive ... · christian tanasescu sgi ™ scalability...
TRANSCRIPT
Christian Tanasescu
sgi™
Scalability Considerations for Compute Intensive Applications on
Clusters
Christian TanasescuDaniel Thomas
SGI Inc.
Christian Tanasescu
sgi™Agenda
– Applications Segments– HPC Computational Requirements– Scalability and Application profiles– Standard benchmarks vs application – Communication vs Computation ratio– BandeLa - profiling and modeling tool – Platform Directions– Conclusions
Christian Tanasescu
sgi™The Move to Technology Transparency
Application drives the platformHigh
Low
Sella
ble
Feat
ure
Time1985–1995 1995–1999 2000–2005
Feeds and speeds Architecture
Feeds and speeds
Feeds and speeds
Architecture
Application
Christian Tanasescu
sgi™Application Segments
CSM - Computational Structural MechanicsCFD - Computational Fluid DynamicsCCM - Computational Chemistry and Material ScienseBIO - BioinformaticsSPI- Seismic Processing and Interpretation RES- Reservoir Simulation CWO- Climate/Weather/Ocean Simulation
Christian Tanasescu
sgi™Sweet Spot Scalability
per Application Segment
512
256
128
64
32
16
8
4
2
1CSM CFD
32- 512p
8-64p
CCM BIO SPI RES
4-32p
8-64p
CWO
16-64p
4-16p
1-32p
Christian Tanasescu
sgi™HPC Resource Demands for Energy
System Resource Benefits/RequirementsEnergy Segment Software CPU Memory BW I/O BW Comm. BW Latency Scalability
SeismicProcessing1
ProMAXOmega
GeoDepthH H H L–M2 H
~4to
500
ReservoirSimulation
VIPEclipse M H L H M ~100
1 Seismic processing packages such as ProMAX are comprised of a large number of executables. The data in this row are forthe subsets of executables that are most time-consuming.2 There are modules for which this entry would be H, but they comprise only about 10% of the total seismic processingworkload.
Christian Tanasescu
sgi™
MCAE Segment
IFEAStatics
IFEADynamics
EFEA
CFDUnstructured
CFDStructured
Software
ABAQUSANSYS
MSC.Nastran
ABAQUSANSYS
MSC.Nastran
LS-DYNAPAM-CRASH
RADIOSS
FLUENTSTAR-CD
PowerFLOW
OVERFLOW
System Resource Benefits/RequirementsCPU Memory BW I/O BW Com. BW Latency Scalability
H H M L L < 10p
L H H H L < 10p
H L L M M ~ 50p
M H M H H ~ 100p
H H L M M ~ 100p
HPC Resource Demands for CAE
Christian Tanasescu
sgi™HPC Resource Demands for
BioinformaticsSystem Resource Benefits/Requirements
Bio Segment Software CPU Memory BW I/O BW Comm. BW Latency Scalability
Blast, FastaSmith-W.HMMER,
Wise2H M L L M 4-32
SequenceMatching
HTC + seq.match. code H H M L M ~100
SequenceAlignment
ClustalWPhylip H M L M L 24
SequenceAssembly
PhrapParacel H M M M L 16
Christian Tanasescu
sgi™HPC Resource Demands for
Computational ChemistrySystem Resource Benefits/Requirements
Segment Software CPU Memory BW I/O BW Comm.BW
Latency Scalability
"ab-initio"Gaussian
Gamess ADFCASTEP
H H /M H /M L L 1-32
QMSemi-
empiricalMopacAmpac
H L L L M 1-4
MM/MDAmber
CharmmNAMD
H M M M H 1-64Docking Dock FleXx H L L L L 1-64
QM: Quantum Mechanics. A large variation for Memory BW, I/O BW and scalabilityMM/MD: Molecular Mechanics/Molecular DynamicsDocking: Scalability via throughput
Christian Tanasescu
sgi™HPC Resource Demands for Weather
and Climate ModelsSegment
explicit finite difference
semi-implicitfinite
difference
spectral climate models
spectralweather models
coupledclimate models
Software
MM5
HIRLAM
CCM3/CAM
NOGAPSIFS
ALADIN
CCSM2FMS
System Resource Benefits/RequirementsCPU Memory BW I/O BW Com. BW Latency Scalability
H M L L H ~100-500p
H M L H H ~100-500p
H M L H M ~ 64-128p
H M L H M ~ 200p
H M L H H ~ 100p
Christian Tanasescu
sgi™Performance Dependency on ArchitectureR12K@400MHz, 8MB L2 in O2000 and O3000
1,21
1,29 1,26
1,111,17 1,17
1,251,18
1,05 1,04 1,07
1,64
1,19
1,03 1,03
0,8
1
1,2
1,4
1,6
Linp
ack
Spec
fp20
00
STRE
AM
Abaq
us/s
td-1
Nas
tran
(103)
Nas
tran
(101)
Star
CD-1
LS-D
yna-
1
Pam
cras
h
Radi
oss-
1
Vect
is-1
Flue
nt-1
CAST
EP
Ambe
r
Gaus
sianPe
rform
ance
relat
ive to
O20
00 @
400M
Hz
Cpu
Memory
Cpu,cache
Performance corridor defined by Linpack (lower limit) and STREAM (higher limit)
Origin3000 @400MHz
•Relative performance improvement in applications is greater than the factor indicated by Specfp2000
•Exception are the I/O intensive apps like Nastran-NVH or Gaussian (BW steers performance)
Christian Tanasescu
sgi™
1,19 1,19 1,19
1,04
1,12
1,2
1,121,17 1,18
1,21,17 1,17
1,141,17 1,19 1,2
1,1 1,11
1,18
1,131,04
0,6
0,8
1
1,2
1,4
Linp
ack
Spec
fp20
00
STRE
AM
Abaq
us/st
d-1
Abaq
us/E
xp-1
Nast
ran(
103)
Nast
ran(
111)
Nast
ran(
101)
Nast
ran(
108)
Ansy
s
Star
CD-1
Star
CD-8
LS-D
yna-
1
Pam
cras
h
Radi
oss-
1
Mady
mo
Vect
is-8
Flue
nt-8
Flue
nt-1
Fire
-1
Powe
rflow
-16
Relat
ive P
erfor
manc
e to O
3000
@50
0MHz
Performance Dependency on Microprocessor Clock Rate - same Architecture
MemoryCpu,cache
Cpu
Performance corridor defined by Linpack (higher limit) and STREAM (lower limit)
Origin3000 @600MHz
Christian Tanasescu
sgi™
0,90,92 0,91
0,930,89
0,92
0,85
0,910,93
0,89 0,9
0,85
11
0,95
0,86 0,87 0,880,9
0,84
0,99
0,6
0,7
0,8
0,9
1
1,1
Linp
ack
Spec
fp20
00
STRE
AM
Abaq
us/st
d-1
Abaq
us/E
xp-1
Nast
ran(
103)
Nast
ran(
111)
Nast
ran(
101)
Nast
ran(
108)
Ansy
s
Star
CD-1
Star
CD-8
LS-D
yna-
1
Pam
cras
h
Radi
oss-
1
Mady
mo
Vect
is-8
Flue
nt-8
Flue
nt-1
Fire
-1
Powe
rflow
-16
Relat
ive P
erfor
manc
e to O
3000
@50
0MHz
Performance Dependency on Microprocessor Cache Size - same Architecture
MemoryCpu,cache
Cpu
Performance corridor defined by Specfp2000 (lower limit) and STREAM (higher limit)
Origin300 @500MHz
Christian Tanasescu
sgi™Key ApplicationsInstruction Mix
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Nas
tran
Ansy
s
Pam
cras
h
Ls-D
yna
Radi
oss
Powe
rflo
w
Flue
nt
Star
HPC
Fire
Gaus
sian
Gam
ess
Ambe
r
CAST
EP ADF
BLAS
T
FAST
A
Clus
talW
MM
5
HIRL
AM
CCM
3
IFS
ProM
AX
Omeg
a
Eclip
se VIP
Floating Point Operations Integer OperationsMemory access instructions Branche Instructions
CSM CCM CWO RESCFD BIO SPI
Christian Tanasescu
sgi™Instruction mix
• Real applications have between 5% - 45% FP instructions with an average of 22%, while the average of memory access instructions is 39%
• A higher number of INT than FP instructions. Exception are BLAS-like solvers as Nastran, Abaqus and ProMAX.
• Ratio of graduated loads and stores to FP operations is 1.7x • Compute Intensive Applications are also Data Intensive Applications• Vector systems had the the system balance=1 (one Flop per Byte)• Next generation architectures need to address memory bandwidth issue
– IO puts additional burden on memory bandwidth
Christian Tanasescu
sgi™System balance
• Supercomputing platforms must balance– Microprocessor power– Memory size– Bandwidth– Latency– I/O balance is another – important consideration
• Supercomputers after Cray 1 began to lose balance
Lower isbetter
1
10
100
1 8 16 32 64# Cpus
Balan
ce
CRAY T90
NEC SX-5
IBM p690 @1.3GHz
SGI O3K @600MHz
HP Sdome @750MHz
HP GS320 @1GHz
SGI Altix 3000 @1GHz
Christian Tanasescu
sgi™Communication vs. Computation Ratio in Key Applications - measured with BandeLa
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Nas
tran
/4
Ansy
s/2
Pam
-Cra
sh/3
2
Ls-D
yna/
48p
Radi
oss/
96
Powe
rFLO
W/6
4
Flue
nt/6
4
Star
HPC
Fire
/32
Gaus
sian/
16
Gam
ess/
32
Ambe
r/8
CAST
EP/12
8
ADF/
32
BLAS
T/16
FAST
A/16
Clus
talW
/16
MM
5
HIRL
AM
CCM
3/16
IFS
ProM
AX
Omeg
a
Eclip
se/5
2
VIP/
32
Computation Wait MPI SW latency Data Transfer
CSM CCM CWO RESCFD BIO SPI
Christian Tanasescu
sgi™Communication Details
• Computation : the time outside MPI• Wait: The time a CPU is locked on mpi_wait
– load unbalance– contention of the traffic through the interconnect gabric or the switch
• MPI SW Latency : the time accounted to the MPI library – sensitive to MPI latency
• Data Transfer : the time the transfer engine is active (bcopy on Origin3000 or Altix 3000). – Sensitive to MPI bandwidth
An important inhibiting factor for scalability is the load imbalance (WAIT).It needs to be addressed by future architectures and programming models
Christian Tanasescu
sgi™BandeLa Profiling Tool
An MPI tool to answer the question : What if the BANDwidth and LAtency change up
or down. 1) Run the application with the targeted number of CPU
in order to capture the timings outside the Mpi calls and to capture the sequence of MPI “kernels” generated by the Mpi library( Isend, Irecv, wait, test)
2) Replay the timings applying a simple model to time the above kernels
Christian Tanasescu
sgi™BandeLa Profiling Tool
– Several topologies can be specified • single host, clusters
– Several communication schemes• O3000 like (receiving CPU does the transfer)• Synchronous and Asynchronous transfers• Interleaving (an arriving message shares the hardware
immediately) • No interleaving (an arriving message waits for the previous
messages to fully complete)
Christian Tanasescu
sgi™BandeLa- Basic Functionality
– The MPI library transforms any MPI function in a sequence of 4 kernels :
• MPI_SGI_request_send ≡ mpi_isend• MPI_SGI_request_recv ≡ mpi_irecv• MPI_SGI_request_test ≡ mpi_test• MPI_SGI_request_wait ≡ mpi_wait
– The Bandela instrumentation catches these sequences and records the computational time outside MPI. This is an application “signature” independent from the communication hardware
Christian Tanasescu
sgi™BandeLa- Instrumentation example
– No need to relink. Some environment variables can also be set in order to partially instrument the application without relinkingsetenv LD_LIBRARY64_PATH .../ACQUIRE_64setenv RLD64_LIST libbandela.so:DEFAULTf77 –o test_bcast test_bcast.f –lmpisetenv MPI_BUFFER_MAX 2000mpirun –np 4 test_bcast
-One file is created for each process-4 Files are created for this example:fort.177, fort.178, fort.179, fort.180 (starting file.177 may be change with an environment Variable)
Christian Tanasescu
sgi™BandeLa- Parameters (single host)
• MPI Latency :• The time which is accounted to the MPI software for doing its
work (queuing messages, checking message arrivals,…)• For the model this is just the amount of time simply added to
the communication table of the particular CPU on entry of a “MPI kernel function” :
• 2.25 µs on Origin 3000 or 4.5 µs full send-receive on Origin3000
• MPI Bandwidth :• The speed at which bcopy is doing its job.• 250Mb /s in average on the Origin3000
Christian Tanasescu
sgi™BandeLa- Validating the model with
measurement (Origin3000 single host)
Model using a Bandwidth of 250Mb/s(“default””
CCM3 - spectral climate model on 16 CPU Origin3000
0
100
200
300
400
500
600
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
MPI ranks
Elap
sed
time
(sec
s)
Measured communicationData Physical TransferMPI SW LatencyWait Computation
Christian Tanasescu
sgi™BandeLa- Validating the model with
measurement (Origin3000 single host)
Model using tuned bandwidth 225 Mb/s
CCM3 - spectral climate model on 16 CPU Origin3000
0
100
200
300
400
500
600
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
MPI ranks
Elap
se T
ime(
s)
Measured communicationData Physical TransferMPI SW LatencyWait Computation
Christian Tanasescu
sgi™BandeLa- “what- if” analysis
• Topologies vailable
SINGLE SHAREDMEMORY HOST(0rigin3000 or Altix 3000)
SWITCH(clusters)
Christian Tanasescu
sgi™Bandela - Data Transfer methods
• Transfer synchronously done by the receiving CPU (Origin3000 or Altix 3000 host)
• Transfer synchronously/asynchronously done with interleaving/no interleaving , constant bandwidth
• Transfer asynchronously with interleaving• bandwidth depending of request size(Myrinet)
Christian Tanasescu
sgi™BandeLa Myrinet parameters?
• MPI SW Latency : Myrinet 2000 ping-pong latency on Origin300 has been measured: 17µs
• Bandwidth : As for the Origin300 single host, the workload may change the adapter performance but :
-The bandwidth also depends of the size-The CPUs have to share the adapter(s) ( this is considered in the model used in BandeLa)
-The number of adapters used change one adapter performance
Christian Tanasescu
sgi™BandeLa Myrinet parameters?
Bandwidth chart given by Myricom.Bandwidth chart used by the modIt depends only of the asymptotic which is not 250 Mb/s on the real w
Myrinet Bandwidth modeled
0
50
100
150
200
250
300
1 10 100 1000 10000 100000 1000000
message sizeM
b/s
Christian Tanasescu
sgi™BandeLa Myrinet parameters?
We use the following asymptotic values : 1 adapter : 93 Mb/s2 adapters : 85 Mb/s4 adapters : 75 Mb/s
These values were set from runs with two Origin300 linked with 4 adapters on both machines.We think these are the asymptotic bandwidths really seen by the applications
Christian Tanasescu
sgi™CCM3 study
Bandwidth effectCCM3 - large case on O3000 @600Mhz, 64 CPUs
0
50
100
150
200
250
300
350
400
0 5 10 15 20 25 30 35 40 45 50 55 60 2 7 12 17 22 27 32 37 42 47 52 57 62
MPI ranks
Elap
se(s
)
Computation Load Unbalance MPI latency Physical Data Transfer
Current perf. Model: MPI latency 5.5 msec, Bandwith 275Mb/s
4x Bandwith (1.1Gb/s)
Ratio unbalance 1.4
Ratio CPU : 1
Christian Tanasescu
sgi™CCM3 study
CCM3- large case on 64 CPU Origin3000 SSI and 4 x 16p Origin3000
0
200
400
600
800
1000
1200
1400
0 9 18 27 36 45 54 63 5 14 23 32 41 50 59 1 10 19 28 37 46 55
MPI ranks
Elap
sed
time(
secs
)
Data Physical TransferMpi SW Latency WaitComputation
CCM3 performance highly depends ofthe number of communication channels
64p O300
4x16p O3001 Myrinet board
4x16p O3004 Myrinet boards
Christian Tanasescu
sgi™CASTEP - Latency effect
CASTEP - 24p execution modelled for different latencies
0
10
20
30
40
50
60
70
80
90
100
0 3 6 9 12 15 18 21 0 3 6 9 12 15 18 21 0 3 6 9 12 15 18 21
Nr. of Processors
Elap
sed
Tim
e (s)
Computation Load unbalancing MPI latency Physical Data Transfer
Origin3000 -SSI Latency GSN
Latency HIPPI 800
Christian Tanasescu
sgi™Communication vs. Computation
Sweep3D on Origin3000 @600 MHz
0
50
100
150
200
250
300
350
400
16 32 64 96 128
Number of MPI tasks
Elap
se ti
me
(s)
Data TransferMPI LatencyWaitComputation
Christian Tanasescu
sgi™BadeLa Pam-Crash study Scalability on Altix 3000
– Tests on Altix 64 CPU, 64p Itanium2 @900MHz, 1.5MB L3 cache, Single System Image (SSI)
– Pam-Crash V2003 DMP-SP– Using BMW model w/ 284,000 elements– Run for 5000 time steps– a special library is used to time or model the
communication
Christian Tanasescu
sgi™Pam-Crash V2003 on Altix 3000
Automotive model, 5000 time steps
– Computation scales up to 64– Communication overhead is too high at 64p
PAM-Crash Altix 900 MHzSpeed-up BMW 6 284 000 elements
0
10
20
30
40
50
60
70
0 10 20 30 40 50 60 70
Speed up
# C
PU
Computation Speed up rank 0Global speed-upPerfect
Christian Tanasescu
sgi™Pam-Crash V2003 on Altix 3000BMW6 model, 5000 time steps
Pam-Crash BMW6 Altix 900MHz Bandela16 CPU Altix (model),
Perfect MPI machine(model)16 CPU Altix measured
0
100
200
300
400
500
600
700
800
0 3 7 11 15 0 4 8 12 1 5 9 13
MPI ranks
Elap
seMeasuredPhys_transferMpi_Sw_lateWAITcomputation
The Bandela model estimates that a “perfect” MPI machine ( zero latency, infinite bandwidth ) would not help . The run Altix 3000 isclosed to a “perfect” MPI model
Christian Tanasescu
sgi™BandeLa -Vampir compatibility
BandeLa can generate a trace compatible with the Vampir Browser. Using Vampir you can zoom the latency and Bandwidth changes at any degree of details
Christian Tanasescu
sgi™Requirements for Petaflops applications
• Memory and cache footprint– amount of memory req. at each level of the memory hierarchy
• Degree of data reuse – associated with core kernels of the apps, the scaling of these
kernels, and the associated estimate of memory BW required at each level of the memory hierarchy
• Instruction mix (fp,integer, ld/st)• IO/ requirements and storage for temp results and
checkpoints• Amount of concurrency available in the apps, and
communication requirements – bisection BW, latency, fast synchronization patterns
• Communication/computation ratio and degree of overlap
Processor
Cache(s)L1,L2,L3
Disk
MainMemory
3-50 cycles
1.5 millioncycles
57-72 cycles
Christian Tanasescu
sgi™Big Datasets : Generic
x(1000)
y(1000)
z(1000)
t(1000)
Tera-Scale (10004)program main
real*8 pressure(1000,1000,1000,1000)real*8 volume(1000,1000,1000,1000)real*8 temperature(1000,1000,1000,1000)do k = 1,1000
do j = 1,1000do i = 1,1000
do time_step=1,1000pressure(i,j,k,time_step) = 0.volume(i,j,k, time_step) = 0.temperature(i,j,k, time_step) = 0.
end doend do
end doend doprint *, pressure(1,1,1,1), volume(1,1,1,1), temperature(1,1,1,1)stopend
C $f77 -64 main.f (to compile)C $limit stacksize 24000000m (before run), 24TBC Only 3 attributes
Christian Tanasescu
sgi™Supercluster concept
SGI® Altix™3000
• First industry-standard Linux® cluster with global shared memory• NUMA support for large nodes: Single node to 64 CPU, 512 GB of memory• Global shared memory: Clusters to 2,048 CPU, 16 TB of memory• All nodes can access one large memory space efficiently, so complex
communication and data passing between nodes isn’t needed, big data sets fit entirely in memory; less disk I/O is needed
node+
OS
...Commodity interconnect
memmemmemmemmemmemnode
+OS
node+
OS
node+
OS
node+
OS
node+
OS
node+
OS...
NUMAFlex™ interconnectGlobal Shared Memory
node+
OS
node+
OS
node+
OS
Conventional Clusters Supercluster SGI® Altix™ 3000
Christian Tanasescu
sgi™Parallel Programming Models
Partition
Partition
Partition
Intra-Partition Inter-Partition
OpenMP
Pthreads
MPI
SHMEM
MPI
SHMEM
XPMEM
Altix 3000
Christian Tanasescu
sgi™MPI In Clusters and
Global Addressable Memory
MPI-1 2-sided send/receive latencies (short 8-byte messag
• Gigabit Ethernet (TCP/IP) 100 us(Low-cost Clusters)• Myrinet 13 us (Mid-range Clusters)• Quadrics 4-5 us (High-end Clusters)
•Origin3000 @600MHz, MPT 1.5.2 3.5 us [SMP] •Altix 3000 @1GHz 1.5 us [Supercluster]• Goal 2006 0.9 us
Christian Tanasescu
sgi™SGI Altix 3000 Scalability for
compute intensive applications
0
8
16
24
32
40
48
56
64
1 16 32 48 64Nr. of Processors
Spee
dup
Gaussian (CCM )Amber (CCM )
Fasta (BIO)Star-CD (CFD)Vectis (CFD)
Ls-Dyna (CSM )TAU (CFD)HTC-Blast (BIO)
Fastx (BIO)M M 5 (CWO)CASTEP (CCM )GAM ESS (CCM )
NAM D (CCM )NWChem ICCM )VASP (CCM )
Ideal
Higheris
Better
Scalability on Altix 3000 in general similar to Origin3000
Christian Tanasescu
sgi™Platform Directions
Mainframe Era
Tota
l cos
t of C
ompu
ting
Cost of Comm. Bandwidth
Centralised computing more cost effective
1985
Total Cost of Computing = cost of HW + SW + related support costs
Corporate Resource is expensive and needs to be shared
High
Low
LowHigh
Time
Christian Tanasescu
sgi™Platform Directions
Decentralized Computing
Tota
l cos
t of C
ompu
ting
Cost of Comm. Bandwidth
Centralised computing more cost effective
1985
Decentralised client servercomputing more cost effective
1992Corporate Resource ischeap enough that I don’t have to share
High
Low
Low High
Time
Total Cost of Computing = cost of HW + SW + related support costs
Christian Tanasescu
sgi™Platform Directions
Server Consolidation
Tota
l cos
t of C
ompu
ting
Cost of Comm. Bandwidth
Centralised computing more cost effective
1985
Decentralised client servercomputing more cost effective
19922000
Corporate Resource ischeap and can have as much as I want
High
Low
Low High
Time
Total Cost of Computing = cost of HW + SW + related support costs
Processors per node
Node
s
SMPs
NO
Ws Clusters
ofSMPs
Scale up
Scale
out
Christian Tanasescu
sgi™Platform Directions
Grid Computing
Tota
l cos
t of C
ompu
ting
Cost of Comm. Bandwidth
Centralised computing more cost effective
1982
Decentralised client servercomputing more cost effective
19922000
Corporate Resource is cheap and can have as much as I want, but I don’t have to own it.
2007
High
Low
Low High
Total Cost of Computing = cost of HW + SW + related support costs
Processors per node
Node
s
SMPs
NO
Ws Super-Cluster
ofSMPs
Scale up
Scale
out
Christian Tanasescu
sgi™Conclusions
• Compute Intensive Applications are also Data Intensive• Standard benchmarks define a performance corridor for applications • Communication vs. computation profiling of compute intensive applications
is essential in designing scalable parallel computer systems• Load imbalance most influential factor on scalability • Preserving globally addressable memory beyond the boundary of a single
node in a cluster improves not only the communication efficiency through but also improves the load balancing
• Altix 3000 Super-Cluster is a very efficient MPI machine