scalability considerations for compute intensive ... · christian tanasescu sgi ™ scalability...

Christian Tanasescu

sgi™

Scalability Considerations for Compute Intensive Applications on

Clusters

Christian TanasescuDaniel Thomas

SGI Inc.

Christian Tanasescu

sgi™Agenda

– Applications Segments– HPC Computational Requirements– Scalability and Application profiles– Standard benchmarks vs application – Communication vs Computation ratio– BandeLa - profiling and modeling tool – Platform Directions– Conclusions

Christian Tanasescu

sgi™The Move to Technology Transparency

Application drives the platformHigh

Low

Sella

ble

Feat

ure

Time1985–1995 1995–1999 2000–2005

Feeds and speeds Architecture

Feeds and speeds

Feeds and speeds

Architecture

Application

Christian Tanasescu

sgi™Application Segments

CSM - Computational Structural MechanicsCFD - Computational Fluid DynamicsCCM - Computational Chemistry and Material ScienseBIO - BioinformaticsSPI- Seismic Processing and Interpretation RES- Reservoir Simulation CWO- Climate/Weather/Ocean Simulation

Christian Tanasescu

sgi™Sweet Spot Scalability

per Application Segment

512

256

128

64

32

16

8

4

2

1CSM CFD

32- 512p

8-64p

CCM BIO SPI RES

4-32p

8-64p

CWO

16-64p

4-16p

1-32p

Christian Tanasescu

sgi™HPC Resource Demands for Energy

System Resource Benefits/RequirementsEnergy Segment Software CPU Memory BW I/O BW Comm. BW Latency Scalability

SeismicProcessing1

ProMAXOmega

GeoDepthH H H L–M2 H

~4to

500

ReservoirSimulation

VIPEclipse M H L H M ~100

1 Seismic processing packages such as ProMAX are comprised of a large number of executables. The data in this row are forthe subsets of executables that are most time-consuming.2 There are modules for which this entry would be H, but they comprise only about 10% of the total seismic processingworkload.

Christian Tanasescu

sgi™

MCAE Segment

IFEAStatics

IFEADynamics

EFEA

CFDUnstructured

CFDStructured

Software

ABAQUSANSYS

MSC.Nastran

ABAQUSANSYS

MSC.Nastran

LS-DYNAPAM-CRASH

RADIOSS

FLUENTSTAR-CD

PowerFLOW

OVERFLOW

System Resource Benefits/RequirementsCPU Memory BW I/O BW Com. BW Latency Scalability

H H M L L < 10p

L H H H L < 10p

H L L M M ~ 50p

M H M H H ~ 100p

H H L M M ~ 100p

HPC Resource Demands for CAE

Christian Tanasescu

sgi™HPC Resource Demands for

BioinformaticsSystem Resource Benefits/Requirements

Bio Segment Software CPU Memory BW I/O BW Comm. BW Latency Scalability

Blast, FastaSmith-W.HMMER,

Wise2H M L L M 4-32

SequenceMatching

HTC + seq.match. code H H M L M ~100

SequenceAlignment

ClustalWPhylip H M L M L 24

SequenceAssembly

PhrapParacel H M M M L 16

Christian Tanasescu

sgi™HPC Resource Demands for

Computational ChemistrySystem Resource Benefits/Requirements

Segment Software CPU Memory BW I/O BW Comm.BW

Latency Scalability

"ab-initio"Gaussian

Gamess ADFCASTEP

H H /M H /M L L 1-32

QMSemi-

empiricalMopacAmpac

H L L L M 1-4

MM/MDAmber

CharmmNAMD

H M M M H 1-64Docking Dock FleXx H L L L L 1-64

QM: Quantum Mechanics. A large variation for Memory BW, I/O BW and scalabilityMM/MD: Molecular Mechanics/Molecular DynamicsDocking: Scalability via throughput

Christian Tanasescu

sgi™HPC Resource Demands for Weather

and Climate ModelsSegment

explicit finite difference

semi-implicitfinite

difference

spectral climate models

spectralweather models

coupledclimate models

Software

MM5

HIRLAM

CCM3/CAM

NOGAPSIFS

ALADIN

CCSM2FMS

System Resource Benefits/RequirementsCPU Memory BW I/O BW Com. BW Latency Scalability

H M L L H ~100-500p

H M L H H ~100-500p

H M L H M ~ 64-128p

H M L H M ~ 200p

H M L H H ~ 100p

Christian Tanasescu

sgi™Performance Dependency on ArchitectureR12K@400MHz, 8MB L2 in O2000 and O3000

1,21

1,29 1,26

1,111,17 1,17

1,251,18

1,05 1,04 1,07

1,64

1,19

1,03 1,03

0,8

1

1,2

1,4

1,6

Linp

ack

Spec

fp20

00

STRE

AM

Abaq

us/s

td-1

Nas

tran

(103)

Nas

tran

(101)

Star

CD-1

LS-D

yna-

1

Pam

cras

h

Radi

oss-

1

Vect

is-1

Flue

nt-1

CAST

EP

Ambe

r

Gaus

sianPe

rform

ance

relat

ive to

O20

00 @

400M

Hz

Cpu

Memory

Cpu,cache

Performance corridor defined by Linpack (lower limit) and STREAM (higher limit)

Origin3000 @400MHz

•Relative performance improvement in applications is greater than the factor indicated by Specfp2000

•Exception are the I/O intensive apps like Nastran-NVH or Gaussian (BW steers performance)

Christian Tanasescu

sgi™

1,19 1,19 1,19

1,04

1,12

1,2

1,121,17 1,18

1,21,17 1,17

1,141,17 1,19 1,2

1,1 1,11

1,18

1,131,04

0,6

0,8

1

1,2

1,4

Linp

ack

Spec

fp20

00

STRE

AM

Abaq

us/st

d-1

Abaq

us/E

xp-1

Nast

ran(

103)

Nast

ran(

111)

Nast

ran(

101)

Nast

ran(

108)

Ansy

s

Star

CD-1

Star

CD-8

LS-D

yna-

1

Pam

cras

h

Radi

oss-

1

Mady

mo

Vect

is-8

Flue

nt-8

Flue

nt-1

Fire

-1

Powe

rflow

-16

Relat

ive P

erfor

manc

e to O

3000

@50

0MHz

Performance Dependency on Microprocessor Clock Rate - same Architecture

MemoryCpu,cache

Cpu

Performance corridor defined by Linpack (higher limit) and STREAM (lower limit)

Origin3000 @600MHz

Christian Tanasescu

sgi™

0,90,92 0,91

0,930,89

0,92

0,85

0,910,93

0,89 0,9

0,85

11

0,95

0,86 0,87 0,880,9

0,84

0,99

0,6

0,7

0,8

0,9

1

1,1

Linp

ack

Spec

fp20

00

STRE

AM

Abaq

us/st

d-1

Abaq

us/E

xp-1

Nast

ran(

103)

Nast

ran(

111)

Nast

ran(

101)

Nast

ran(

108)

Ansy

s

Star

CD-1

Star

CD-8

LS-D

yna-

1

Pam

cras

h

Radi

oss-

1

Mady

mo

Vect

is-8

Flue

nt-8

Flue

nt-1

Fire

-1

Powe

rflow

-16

Relat

ive P

erfor

manc

e to O

3000

@50

0MHz

Performance Dependency on Microprocessor Cache Size - same Architecture

MemoryCpu,cache

Cpu

Performance corridor defined by Specfp2000 (lower limit) and STREAM (higher limit)

Origin300 @500MHz

Christian Tanasescu

sgi™Key ApplicationsInstruction Mix

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Nas

tran

Ansy

s

Pam

cras

h

Ls-D

yna

Radi

oss

Powe

rflo

w

Flue

nt

Star

HPC

Fire

Gaus

sian

Gam

ess

Ambe

r

CAST

EP ADF

BLAS

T

FAST

A

Clus

talW

MM

5

HIRL

AM

CCM

3

IFS

ProM

AX

Omeg

a

Eclip

se VIP

Floating Point Operations Integer OperationsMemory access instructions Branche Instructions

CSM CCM CWO RESCFD BIO SPI

Christian Tanasescu

sgi™Instruction mix

• Real applications have between 5% - 45% FP instructions with an average of 22%, while the average of memory access instructions is 39%

• A higher number of INT than FP instructions. Exception are BLAS-like solvers as Nastran, Abaqus and ProMAX.

• Ratio of graduated loads and stores to FP operations is 1.7x • Compute Intensive Applications are also Data Intensive Applications• Vector systems had the the system balance=1 (one Flop per Byte)• Next generation architectures need to address memory bandwidth issue

– IO puts additional burden on memory bandwidth

Christian Tanasescu

sgi™System balance

• Supercomputing platforms must balance– Microprocessor power– Memory size– Bandwidth– Latency– I/O balance is another – important consideration

• Supercomputers after Cray 1 began to lose balance

Lower isbetter

1

10

100

1 8 16 32 64# Cpus

Balan

ce

CRAY T90

NEC SX-5

IBM p690 @1.3GHz

SGI O3K @600MHz

HP Sdome @750MHz

HP GS320 @1GHz

SGI Altix 3000 @1GHz

Christian Tanasescu

sgi™Communication vs. Computation Ratio in Key Applications - measured with BandeLa

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Nas

tran

/4

Ansy

s/2

Pam

-Cra

sh/3

2

Ls-D

yna/

48p

Radi

oss/

96

Powe

rFLO

W/6

4

Flue

nt/6

4

Star

HPC

Fire

/32

Gaus

sian/

16

Gam

ess/

32

Ambe

r/8

CAST

EP/12

8

ADF/

32

BLAS

T/16

FAST

A/16

Clus

talW

/16

MM

5

HIRL

AM

CCM

3/16

IFS

ProM

AX

Omeg

a

Eclip

se/5

2

VIP/

32

Computation Wait MPI SW latency Data Transfer

CSM CCM CWO RESCFD BIO SPI

Christian Tanasescu

sgi™Communication Details

• Computation : the time outside MPI• Wait: The time a CPU is locked on mpi_wait

– load unbalance– contention of the traffic through the interconnect gabric or the switch

• MPI SW Latency : the time accounted to the MPI library – sensitive to MPI latency

• Data Transfer : the time the transfer engine is active (bcopy on Origin3000 or Altix 3000). – Sensitive to MPI bandwidth

An important inhibiting factor for scalability is the load imbalance (WAIT).It needs to be addressed by future architectures and programming models

Christian Tanasescu

sgi™BandeLa Profiling Tool

An MPI tool to answer the question : What if the BANDwidth and LAtency change up

or down. 1) Run the application with the targeted number of CPU

in order to capture the timings outside the Mpi calls and to capture the sequence of MPI “kernels” generated by the Mpi library( Isend, Irecv, wait, test)

2) Replay the timings applying a simple model to time the above kernels

Christian Tanasescu

sgi™BandeLa Profiling Tool

– Several topologies can be specified • single host, clusters

– Several communication schemes• O3000 like (receiving CPU does the transfer)• Synchronous and Asynchronous transfers• Interleaving (an arriving message shares the hardware

immediately) • No interleaving (an arriving message waits for the previous

messages to fully complete)

Christian Tanasescu

sgi™BandeLa- Basic Functionality

– The MPI library transforms any MPI function in a sequence of 4 kernels :

• MPI_SGI_request_send ≡ mpi_isend• MPI_SGI_request_recv ≡ mpi_irecv• MPI_SGI_request_test ≡ mpi_test• MPI_SGI_request_wait ≡ mpi_wait

– The Bandela instrumentation catches these sequences and records the computational time outside MPI. This is an application “signature” independent from the communication hardware

Christian Tanasescu

sgi™BandeLa- Instrumentation example

– No need to relink. Some environment variables can also be set in order to partially instrument the application without relinkingsetenv LD_LIBRARY64_PATH .../ACQUIRE_64setenv RLD64_LIST libbandela.so:DEFAULTf77 –o test_bcast test_bcast.f –lmpisetenv MPI_BUFFER_MAX 2000mpirun –np 4 test_bcast

-One file is created for each process-4 Files are created for this example:fort.177, fort.178, fort.179, fort.180 (starting file.177 may be change with an environment Variable)

Christian Tanasescu

sgi™BandeLa- Parameters (single host)

• MPI Latency :• The time which is accounted to the MPI software for doing its

work (queuing messages, checking message arrivals,…)• For the model this is just the amount of time simply added to

the communication table of the particular CPU on entry of a “MPI kernel function” :

• 2.25 µs on Origin 3000 or 4.5 µs full send-receive on Origin3000

• MPI Bandwidth :• The speed at which bcopy is doing its job.• 250Mb /s in average on the Origin3000

Christian Tanasescu

sgi™BandeLa- Validating the model with

measurement (Origin3000 single host)

Model using a Bandwidth of 250Mb/s(“default””

CCM3 - spectral climate model on 16 CPU Origin3000

0

100

200

300

400

500

600

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

MPI ranks

Elap

sed

time

(sec

s)

Measured communicationData Physical TransferMPI SW LatencyWait Computation

Christian Tanasescu

sgi™BandeLa- Validating the model with

measurement (Origin3000 single host)

Model using tuned bandwidth 225 Mb/s

CCM3 - spectral climate model on 16 CPU Origin3000

0

100

200

300

400

500

600

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

MPI ranks

Elap

se T

ime(

s)

Measured communicationData Physical TransferMPI SW LatencyWait Computation

Christian Tanasescu

sgi™BandeLa- “what- if” analysis

• Topologies vailable

SINGLE SHAREDMEMORY HOST(0rigin3000 or Altix 3000)

SWITCH(clusters)

Christian Tanasescu

sgi™Bandela - Data Transfer methods

• Transfer synchronously done by the receiving CPU (Origin3000 or Altix 3000 host)

• Transfer synchronously/asynchronously done with interleaving/no interleaving , constant bandwidth

• Transfer asynchronously with interleaving• bandwidth depending of request size(Myrinet)

Christian Tanasescu

sgi™BandeLa Myrinet parameters?

• MPI SW Latency : Myrinet 2000 ping-pong latency on Origin300 has been measured: 17µs

• Bandwidth : As for the Origin300 single host, the workload may change the adapter performance but :

-The bandwidth also depends of the size-The CPUs have to share the adapter(s) ( this is considered in the model used in BandeLa)

-The number of adapters used change one adapter performance

Christian Tanasescu


Bandwidth chart given by Myricom.Bandwidth chart used by the modIt depends only of the asymptotic which is not 250 Mb/s on the real w

Myrinet Bandwidth modeled

0

50

100

150

200

250

300

1 10 100 1000 10000 100000 1000000

message sizeM

b/s

Christian Tanasescu


We use the following asymptotic values : 1 adapter : 93 Mb/s2 adapters : 85 Mb/s4 adapters : 75 Mb/s

These values were set from runs with two Origin300 linked with 4 adapters on both machines.We think these are the asymptotic bandwidths really seen by the applications

Christian Tanasescu

sgi™CCM3 study

Bandwidth effectCCM3 - large case on O3000 @600Mhz, 64 CPUs

0

50

100

150

200

250

300

350

400

0 5 10 15 20 25 30 35 40 45 50 55 60 2 7 12 17 22 27 32 37 42 47 52 57 62

MPI ranks

Elap

se(s

)

Computation Load Unbalance MPI latency Physical Data Transfer

Current perf. Model: MPI latency 5.5 msec, Bandwith 275Mb/s

4x Bandwith (1.1Gb/s)

Ratio unbalance 1.4

Ratio CPU : 1

Christian Tanasescu

sgi™CCM3 study

CCM3- large case on 64 CPU Origin3000 SSI and 4 x 16p Origin3000

0

200

400

600

800

1000

1200

1400

0 9 18 27 36 45 54 63 5 14 23 32 41 50 59 1 10 19 28 37 46 55

MPI ranks

Elap

sed

time(

secs

)

Data Physical TransferMpi SW Latency WaitComputation

CCM3 performance highly depends ofthe number of communication channels

64p O300

4x16p O3001 Myrinet board

4x16p O3004 Myrinet boards

Christian Tanasescu

sgi™CASTEP - Latency effect

CASTEP - 24p execution modelled for different latencies

0

10

20

30

40

50

60

70

80

90

100

0 3 6 9 12 15 18 21 0 3 6 9 12 15 18 21 0 3 6 9 12 15 18 21

Nr. of Processors

Elap

sed

Tim

e (s)

Computation Load unbalancing MPI latency Physical Data Transfer

Origin3000 -SSI Latency GSN

Latency HIPPI 800

Christian Tanasescu

sgi™Communication vs. Computation

Sweep3D on Origin3000 @600 MHz

0

50

100

150

200

250

300

350

400

16 32 64 96 128

Number of MPI tasks

Elap

se ti

me

(s)

Data TransferMPI LatencyWaitComputation

Christian Tanasescu

sgi™BadeLa Pam-Crash study Scalability on Altix 3000

– Tests on Altix 64 CPU, 64p Itanium2 @900MHz, 1.5MB L3 cache, Single System Image (SSI)

– Pam-Crash V2003 DMP-SP– Using BMW model w/ 284,000 elements– Run for 5000 time steps– a special library is used to time or model the

communication

Christian Tanasescu

sgi™Pam-Crash V2003 on Altix 3000

Automotive model, 5000 time steps

– Computation scales up to 64– Communication overhead is too high at 64p

PAM-Crash Altix 900 MHzSpeed-up BMW 6 284 000 elements

0

10

20

30

40

50

60

70

0 10 20 30 40 50 60 70

Speed up

# C

PU

Computation Speed up rank 0Global speed-upPerfect

Christian Tanasescu

sgi™Pam-Crash V2003 on Altix 3000BMW6 model, 5000 time steps

Pam-Crash BMW6 Altix 900MHz Bandela16 CPU Altix (model),

Perfect MPI machine(model)16 CPU Altix measured

0

100

200

300

400

500

600

700

800

0 3 7 11 15 0 4 8 12 1 5 9 13

MPI ranks

Elap

seMeasuredPhys_transferMpi_Sw_lateWAITcomputation

The Bandela model estimates that a “perfect” MPI machine ( zero latency, infinite bandwidth ) would not help . The run Altix 3000 isclosed to a “perfect” MPI model

Christian Tanasescu

sgi™BandeLa -Vampir compatibility

BandeLa can generate a trace compatible with the Vampir Browser. Using Vampir you can zoom the latency and Bandwidth changes at any degree of details

Christian Tanasescu

sgi™Requirements for Petaflops applications

• Memory and cache footprint– amount of memory req. at each level of the memory hierarchy

• Degree of data reuse – associated with core kernels of the apps, the scaling of these

kernels, and the associated estimate of memory BW required at each level of the memory hierarchy

• Instruction mix (fp,integer, ld/st)• IO/ requirements and storage for temp results and

checkpoints• Amount of concurrency available in the apps, and

communication requirements – bisection BW, latency, fast synchronization patterns

• Communication/computation ratio and degree of overlap

Processor

Cache(s)L1,L2,L3

Disk

MainMemory

3-50 cycles

1.5 millioncycles

57-72 cycles

Christian Tanasescu

sgi™Big Datasets : Generic

x(1000)

y(1000)

z(1000)

t(1000)

Tera-Scale (10004)program main

real*8 pressure(1000,1000,1000,1000)real*8 volume(1000,1000,1000,1000)real*8 temperature(1000,1000,1000,1000)do k = 1,1000

do j = 1,1000do i = 1,1000

do time_step=1,1000pressure(i,j,k,time_step) = 0.volume(i,j,k, time_step) = 0.temperature(i,j,k, time_step) = 0.

end doend do

end doend doprint *, pressure(1,1,1,1), volume(1,1,1,1), temperature(1,1,1,1)stopend

C $f77 -64 main.f (to compile)C $limit stacksize 24000000m (before run), 24TBC Only 3 attributes

Christian Tanasescu

sgi™Supercluster concept

SGI® Altix™3000

• First industry-standard Linux® cluster with global shared memory• NUMA support for large nodes: Single node to 64 CPU, 512 GB of memory• Global shared memory: Clusters to 2,048 CPU, 16 TB of memory• All nodes can access one large memory space efficiently, so complex

communication and data passing between nodes isn’t needed, big data sets fit entirely in memory; less disk I/O is needed

node+

OS

...Commodity interconnect

memmemmemmemmemmemnode

+OS

node+

OS

node+

OS

node+

OS

node+

OS

node+

OS...

NUMAFlex™ interconnectGlobal Shared Memory

node+

OS

node+

OS

node+

OS

Conventional Clusters Supercluster SGI® Altix™ 3000

Christian Tanasescu

sgi™Parallel Programming Models

Partition

Partition

Partition

Intra-Partition Inter-Partition

OpenMP

Pthreads

MPI

SHMEM

MPI

SHMEM

XPMEM

Altix 3000

Christian Tanasescu

sgi™MPI In Clusters and

Global Addressable Memory

MPI-1 2-sided send/receive latencies (short 8-byte messag

• Gigabit Ethernet (TCP/IP) 100 us(Low-cost Clusters)• Myrinet 13 us (Mid-range Clusters)• Quadrics 4-5 us (High-end Clusters)

•Origin3000 @600MHz, MPT 1.5.2 3.5 us [SMP] •Altix 3000 @1GHz 1.5 us [Supercluster]• Goal 2006 0.9 us

Christian Tanasescu

sgi™SGI Altix 3000 Scalability for

compute intensive applications

0

8

16

24

32

40

48

56

64

1 16 32 48 64Nr. of Processors

Spee

dup

Gaussian (CCM )Amber (CCM )

Fasta (BIO)Star-CD (CFD)Vectis (CFD)

Ls-Dyna (CSM )TAU (CFD)HTC-Blast (BIO)

Fastx (BIO)M M 5 (CWO)CASTEP (CCM )GAM ESS (CCM )

NAM D (CCM )NWChem ICCM )VASP (CCM )

Ideal

Higheris

Better

Scalability on Altix 3000 in general similar to Origin3000

Christian Tanasescu

sgi™Platform Directions

Mainframe Era

Tota

l cos

t of C

ompu

ting

Cost of Comm. Bandwidth

Centralised computing more cost effective

1985

Total Cost of Computing = cost of HW + SW + related support costs

Corporate Resource is expensive and needs to be shared

High

Low

LowHigh

Time

Christian Tanasescu


Decentralized Computing

Tota

l cos

t of C

ompu

ting



1985

Decentralised client servercomputing more cost effective

1992Corporate Resource ischeap enough that I don’t have to share

High

Low

Low High

Time


Christian Tanasescu


Server Consolidation

Tota

l cos

t of C

ompu

ting



1985


19922000

Corporate Resource ischeap and can have as much as I want

High

Low

Low High

Time


Processors per node

Node

s

SMPs

NO

Ws Clusters

ofSMPs

Scale up

Scale

out

Christian Tanasescu


Grid Computing

Tota

l cos

t of C

ompu

ting



1982


19922000

Corporate Resource is cheap and can have as much as I want, but I don’t have to own it.

2007

High

Low

Low High


Processors per node

Node

s

SMPs

NO

Ws Super-Cluster

ofSMPs

Scale up

Scale

out

Christian Tanasescu

sgi™Conclusions

• Compute Intensive Applications are also Data Intensive• Standard benchmarks define a performance corridor for applications • Communication vs. computation profiling of compute intensive applications

is essential in designing scalable parallel computer systems• Load imbalance most influential factor on scalability • Preserving globally addressable memory beyond the boundary of a single

node in a cluster improves not only the communication efficiency through but also improves the load balancing

• Altix 3000 Super-Cluster is a very efficient MPI machine

Christian Tanasescu

sgi™

Thank you… Any questions ?

or

Email to [email protected]

scalability considerations for compute intensive ... · christian tanasescu sgi ™ scalability...

Documents