performance analysis of ga applications19th june 2001 computational science and engineering...
TRANSCRIPT
Performance analysis of GA applications 19th June 2001
Computational Science and Engineering Department Daresbury Laboratory
Performance analysis of GA-based Performance analysis of GA-based applications using the Vampir tool applications using the Vampir tool
NWChem and GAMESS-UK on High-end and NWChem and GAMESS-UK on High-end and Commodity class machines.Commodity class machines.
H.J.J. van DamH.J.J. van Dam, Martyn Guest and Paul Sherwood, , Martyn Guest and Paul Sherwood, Quantum Chemistry Group, CLRC Daresbury LaboratoryQuantum Chemistry Group, CLRC Daresbury Laboratory
http://www.cse.clrc.ac.ukhttp://www.cse.clrc.ac.uk
Miles DeeganMiles DeeganCompaq (Galway)Compaq (Galway)
Performance analysis of GA applications 19th June 2001
Computational Science and Engineering Department Daresbury Laboratory
OutlineOutline Background : PNNL, Daresbury and PALLASBackground : PNNL, Daresbury and PALLAS Tool for Performance Analysis - VAMPIR & VAMPIR TraceTool for Performance Analysis - VAMPIR & VAMPIR Trace
VAMPIR - analysis of trace filesVAMPIR - analysis of trace files VAMPIR Trace VAMPIR Trace
Trace Library for MPI applicationsTrace Library for MPI applications Extensions to handle GA applicationsExtensions to handle GA applications
Case StudiesCase Studies DFT Calculations on Zeolite Fragments (347 - 1687 GTOs) with DFT Calculations on Zeolite Fragments (347 - 1687 GTOs) with
Coulomb FittingCoulomb Fitting High-end Systems - Cray T3E/1200E, Compaq AlphaServer SC High-end Systems - Cray T3E/1200E, Compaq AlphaServer SC
(667 & 833 MHz), SGI Origin 3000/R12k-400 and IBM SP/WH2-375 (667 & 833 MHz), SGI Origin 3000/R12k-400 and IBM SP/WH2-375 Commodity Clusters (IA32 and Alpha Linux)Commodity Clusters (IA32 and Alpha Linux)
NWChem and GAMESS-UKNWChem and GAMESS-UK Distributed data (NWchem) and Replicated Data (GAMESS-UK)Distributed data (NWchem) and Replicated Data (GAMESS-UK) Analysis of GAs and PeIGsAnalysis of GAs and PeIGs
SummarySummary
Performance analysis of GA applications 19th June 2001
Computational Science and Engineering Department Daresbury Laboratory
PNNL - Daresbury - Pallas Collaborations PNNL - Daresbury - Pallas Collaborations
PNNL - Daresbury CollaborationPNNL - Daresbury Collaboration
Long term interaction between chemistry activitiesLong term interaction between chemistry activities Proposed developments around DFT derivative codesProposed developments around DFT derivative codes
UK Chemistry Collaboration Forum (CCP1)UK Chemistry Collaboration Forum (CCP1) DFT Flagship project and subsequent DL extensionsDFT Flagship project and subsequent DL extensions DFT Functional Repository (http://www.dl.ac.uk/DFTlib) DFT Functional Repository (http://www.dl.ac.uk/DFTlib)
Daresbury - Pallas CollaborationDaresbury - Pallas Collaboration
Demonstrate that Demonstrate that clusters of IA32 and Alphaclusters of IA32 and Alpha processors are processors are competitive with HPC servers (with competitive with HPC servers (with low to mediumlow to medium processor processor numbers) for a numbers) for a wide range of applicationswide range of applications
Evaluate the suitability of clusters for Evaluate the suitability of clusters for high-end computinghigh-end computing Analyse Analyse kernels and full applicationskernels and full applications (May 2000 - Sep.2001) (May 2000 - Sep.2001)
Performance analysis of GA applications 19th June 2001
Computational Science and Engineering Department Daresbury Laboratory
Vampir 2.5
VVisualization and AAnalysis ofMPIMPI Prrograms
Performance analysis of GA applications 19th June 2001
Computational Science and Engineering Department Daresbury Laboratory
Vampir FeaturesVampir Features
Offline trace analysis for MPI (and others ...)Offline trace analysis for MPI (and others ...) Traces generated by Traces generated by Vampirtrace Vampirtrace tool (`ld ... -lVT -lpmpi -lmpi`)tool (`ld ... -lVT -lpmpi -lmpi`) Convenient user–interfaceConvenient user–interface Scalability Scalability in time and processor–spacein time and processor–space Excellent Excellent zoomingzooming and and filteringfiltering High–performance graphicsHigh–performance graphics Display and analysis of Display and analysis of MPIMPI andand application application events:events:
execution of execution of MPIMPI routines routines point–to–point and collective communicationpoint–to–point and collective communication MPI–2 I/O operationsMPI–2 I/O operations execution of application subroutines (optional)execution of application subroutines (optional)
““Easy” customizationEasy” customization
Performance analysis of GA applications 19th June 2001
Computational Science and Engineering Department Daresbury Laboratory
Vampir DisplaysVampir Displays
Global displaysGlobal displays show all selected processes show all selected processes Summary Chart:Summary Chart: aggregated profiling information aggregated profiling information Activity Chart:Activity Chart: presents per–process profiling information presents per–process profiling information Timeline:Timeline: detaileddetailed application execution over time axisapplication execution over time axis Communication statistics:Communication statistics: message statistics for each process pair message statistics for each process pair Global Comm. Statistics:Global Comm. Statistics: collective operations statistics collective operations statistics I/O Statistics:I/O Statistics: MPI I/O operation statisticsMPI I/O operation statistics Calling Tree:Calling Tree: draws global or local dynamic calling trees draws global or local dynamic calling trees
Process displaysProcess displays show a single process per window show a single process per window Activity ChartActivity Chart TimelineTimeline Calling TreeCalling Tree
Performance analysis of GA applications 19th June 2001
Computational Science and Engineering Department Daresbury Laboratory
Timeline Display (Message Info)Timeline Display (Message Info)
Source–code references are displayed if recorded by VampirtraceSource–code references are displayed if recorded by Vampirtrace
Click on message line
See message details
Messagesend op Message receive op
Performance analysis of GA applications 19th June 2001
Computational Science and Engineering Department Daresbury Laboratory
Vampirtrace
Tracing ofMPI andApplicationEvents
Performance analysis of GA applications 19th June 2001
Computational Science and Engineering Department Daresbury Laboratory
VampirtraceVampirtrace
Current version: Vampirtrace 2.0Current version: Vampirtrace 2.0 Significant new features:Significant new features:
records collective records collective communicationcommunication
enhanced filter functionsenhanced filter functions extended APIextended API records source–code information records source–code information
(selected platforms)(selected platforms) support for shmem (Cray T3E)support for shmem (Cray T3E) records MPI–2 I/O operationsrecords MPI–2 I/O operations
Available for all major MPI Available for all major MPI platformsplatforms
Library that records all Library that records all MPI MPI calls, calls, point to point communication, and point to point communication, and collective operations.collective operations.
Runtime filters available to limit Runtime filters available to limit tracefile size.tracefile size.
Provides an API for user Provides an API for user instrumentation.instrumentation.
Requires MPIRequires MPI to gather to gather performance data.performance data.
Uses the profiling interface of MPI Uses the profiling interface of MPI and is therefore independent of the and is therefore independent of the specifics of a given MPI specifics of a given MPI implementation.implementation.
Performance analysis of GA applications 19th June 2001
Computational Science and Engineering Department Daresbury Laboratory
Vampirtrace APIVampirtrace API
Switching tracing on/offSwitching tracing on/off SUBROUTINE SUBROUTINE VTTRACEOFFVTTRACEOFF( )( ) SUBROUTINE SUBROUTINE VTTRACEONVTTRACEON( )( )
Specifying user-defined statesSpecifying user-defined states SUBROUTINE SUBROUTINE
VTSYMDEFVTSYMDEF(ICODE, STATE, (ICODE, STATE, ACTIVITY, IERR)ACTIVITY, IERR)
Entering/leaving user-defined Entering/leaving user-defined statesstates SUBROUTINE SUBROUTINE VTBEGINVTBEGIN(ICODE, (ICODE,
IERR)IERR) SUBROUTINE SUBROUTINE VTENDVTEND(ICODE, (ICODE,
IERR)IERR)
Logging message send/receive Logging message send/receive events (undocumented)events (undocumented) SUBROUTINE SUBROUTINE VTLOGVTLOGSENDMSGSENDMSG( (
IME, ITO, ICNT, ITAG, ICOMMID, IME, ITO, ICNT, ITAG, ICOMMID, IERR)IERR)
SUBROUTINE SUBROUTINE VTLOGRECVMSGVTLOGRECVMSG( ( IME, IFRM, ICNT, ITAG, IME, IFRM, ICNT, ITAG, ICOMMID, IERR)ICOMMID, IERR)
Performance analysis of GA applications 19th June 2001
Computational Science and Engineering Department Daresbury Laboratory
Single, shared data structure
Physically distributed data • Shared-memory-like model– Fast local access– NUMA aware and easy to use– MIMD and data-parallel modes– Inter-operates with MPI, …
• BLAS and linear algebra interface• Ported to major parallel machines
– IBM, Cray, SGI, clusters,...• Originated in an HPCC project• Used by 5 major chemistry codes,
financial futures forecasting,astrophysics, computer graphics
Global ArraysGlobal Arrays
Performance analysis of GA applications 19th June 2001
Computational Science and Engineering Department Daresbury Laboratory
Instrumenting single-sided memory Instrumenting single-sided memory accessaccess
Approach 1:Approach 1: Instrument the puts, gets and data Instrument the puts, gets and data serverserver Advantage:Advantage: robust and accurate robust and accurate Disadvantage:Disadvantage: one does not always have access one does not always have access
to the source of the data serverto the source of the data server
Approach 2:Approach 2: Instrument the puts and gets only, Instrument the puts and gets only, cheating on the source and destination of the cheating on the source and destination of the messagesmessages Advantage:Advantage: no instrumentation of the data server no instrumentation of the data server
requiredrequired Disadvantage:Disadvantage: timings of the messages are timings of the messages are
inaccurate in case of non-blocking operationsinaccurate in case of non-blocking operations
Performance analysis of GA applications 19th June 2001
Computational Science and Engineering Department Daresbury Laboratory
RuntimRuntime tracing optionse tracing options
The tracing of activities can The tracing of activities can be modified at runtime be modified at runtime through a configuration file.through a configuration file.
Tracing of messages can Tracing of messages can not be changed.not be changed.
VTTRACEON and VTTRACEON and VTTRACEOFF should be VTTRACEOFF should be used sparingly.used sparingly.
Logfile-name /home/user/prog.bpvLogfile-name /home/user/prog.bpv
Symbol nnodes offSymbol nnodes off
Symbol nodeid offSymbol nodeid off
Symbol GA_Nnodes offSymbol GA_Nnodes off
Symbol GA_Nodeid offSymbol GA_Nodeid off
Practical issues
• The vampirtrace library and evaluation licenses can be The vampirtrace library and evaluation licenses can be downloaded from downloaded from http://www.pallas.com/http://www.pallas.com/• Evaluation licenses are limited to 32 processorsEvaluation licenses are limited to 32 processors• CPU cycle providers are not too keen to provide vampirtrace?CPU cycle providers are not too keen to provide vampirtrace?
Performance analysis of GA applications 19th June 2001
Computational Science and Engineering Department Daresbury Laboratory
Case Studies - Zeolite FragmentsCase Studies - Zeolite Fragments
SiSi88OO77HH1818 347/832347/832
SiSi88OO2525HH1818 617/1444617/1444
SiSi2626OO3737HH3636 1199/28181199/2818
SiSi2828OO6767HH3030 1687/39281687/3928
• DFT Calculations with DFT Calculations with Coulomb FittingCoulomb Fitting
Basis (Godbout et al.)Basis (Godbout et al.) DZVP - O, SiDZVP - O, Si
DZVP2 - HDZVP2 - HFitting Basis:Fitting Basis:
DGAUSS-A1 - O, SiDGAUSS-A1 - O, SiDGAUSS-A2 - HDGAUSS-A2 - H
• NWChem & GAMESS-UKNWChem & GAMESS-UK
Both codes use auxiliary fitting Both codes use auxiliary fitting basis for coulomb energy, with basis for coulomb energy, with 3 centre 2 electron integrals 3 centre 2 electron integrals held in coreheld in core..
Performance analysis of GA applications 19th June 2001
Computational Science and Engineering Department Daresbury Laboratory
High-End and Commodity SystemsHigh-End and Commodity Systems Cray T3E/1200ECray T3E/1200E
816 processor system at Manchester (CSAR service)816 processor system at Manchester (CSAR service) 600 Mz EV56 Alpha processor with 256 MB memory600 Mz EV56 Alpha processor with 256 MB memory
IBM SP (32 CPU system at DL)IBM SP (32 CPU system at DL) 4-way Winterhawk2 SMP “thin nodes” with 2 GB memory4-way Winterhawk2 SMP “thin nodes” with 2 GB memory 375 MHz Power3-II processors with 8 MB L2 cache375 MHz Power3-II processors with 8 MB L2 cache
Compaq AlphaServer SC - 667 (APAC) and 833 MHz CPUsCompaq AlphaServer SC - 667 (APAC) and 833 MHz CPUs 4-way ES40/667 and /833 SMP nodes with 2 GB memory4-way ES40/667 and /833 SMP nodes with 2 GB memory Alpha 21264a (EV67) CPUs with 8 MB L2 cache Alpha 21264a (EV67) CPUs with 8 MB L2 cache Quadrics “fat tree” interconnect (5 usec latency, 150 MB/sec B/W)Quadrics “fat tree” interconnect (5 usec latency, 150 MB/sec B/W)
SGI Origin 3800SGI Origin 3800 SARA (1000 CPUs) - Numalink with R12k/400 CPUsSARA (1000 CPUs) - Numalink with R12k/400 CPUs
Commodity Systems (DL) Commodity Systems (DL) 32 X IA32 single processor CPUs (Pentium III/450), fast ethernet32 X IA32 single processor CPUs (Pentium III/450), fast ethernet Linux Alpha Cluster (16 X UP2000/667 - Quadrics Interconnect) Linux Alpha Cluster (16 X UP2000/667 - Quadrics Interconnect)
Performance analysis of GA applications 19th June 2001
Computational Science and Engineering Department Daresbury Laboratory
32
64
96
128
160
192
224
256
32 64 96 128 160 192 224 256
Number of Nodes
Sp
eed
-up
ZeoliteFragment
BasisAO/CD
Number ofNodes
Wall Timeto Solution
Si8O7H18 347/832 64 238s
Si8O25H18 617/1444 128 364s
Si26O37H36 1199/2818 256 1137s
Si28O67H30 1687/3928 256 2766s
Measured Parallel Efficiency for NWChem - DFT on IBM-SP; Wall Times to Solution for SCF Convergence
D.A Dixon et al., D.A Dixon et al., HPC, Plenum , 1999, p. 215HPC, Plenum , 1999, p. 215
Performance analysis of GA applications 19th June 2001
Computational Science and Engineering Department Daresbury Laboratory
209
7397
47
0
200
400
600
16 32 64 128
Pentium Beowulf IICray T3E/1200EAlpha Linux ClusterAlphaServer SC/667
530
194 198108
0
600
1200
1800
16 32 64 128
Pentium Beowulf IICray T3E/1200EAlpha Linux ClusterAlphaServer SC/667
DFT Coulomb Fit - NWChemDFT Coulomb Fit - NWChem
Number of CPUs Number of CPUs
Measured Time (seconds)
SiSi88OO77HH1818 347/832347/832 SiSi88OO2525HH1818 617/1444617/1444
Measured Time (seconds)
Performance analysis of GA applications 19th June 2001
Computational Science and Engineering Department Daresbury Laboratory
10800
2731
1382844
0
2000
4000
6000
8000
10000
12000
16 32 64 128
Cray T3E/1200EAlphaServer SC/667Alpha Linux Cluster
8309
17801301
602 887450
0
2000
4000
6000
8000
16 32 64 128
Cray T3E/1200E
Alpha Linux Cluster
AlphaServer SC/667
DFT Coulomb Fit - NWChemDFT Coulomb Fit - NWChem
Number of CPUs Number of CPUs
Measured Time (seconds) Measured Time (seconds)
SiSi2828OO6767HH3030 1687/39281687/3928SiSi2626OO3737HH3636 1199/28181199/2818
TTIBM-SP/P2SC-120 IBM-SP/P2SC-120 (256) = 1137(256) = 1137 TTIBM-SP/P2SC-120 IBM-SP/P2SC-120 (256) = 2766(256) = 2766
Performance analysis of GA applications 19th June 2001
Computational Science and Engineering Department Daresbury Laboratory
NWChem : NWChem : SiSi88OO77HH18 18 and Siand Si2626OO3737HH3636
0
50
100
150
200
250
300
350
8 16 32 64
Diag
ga_demm
DIIS
GetVxc
GtVcoul
FitCD
init g
3c-2e
S Diag
CDinv
Xcinv
0%
20%
40%
60%
80%
100%
8 16 32 64
Diag
ga_demm
DIIS
GetVxc
GtVcoul
FitCD
init g
3c-2e
S Diag
CDinv
Xcinv
0
500
1000
1500
2000
2500
3000
16 32 64
Diag
ga_demm
DIIS
GetVxc
GtVcoul
FitCD
init g
3c-2e
S Diag
CDinv
Xcinv
0%
20%
40%
60%
80%
100%
16 32 64
Diag
ga_demm
DIIS
GetVxc
GtVcoul
FitCD
init g
3c-2e
S Diag
CDinv
Xcinv
SiSi88OO77HH1818
SiSi2626OO3737HH3636
Performance analysis of GA applications 19th June 2001
Computational Science and Engineering Department Daresbury Laboratory
NWChem / NWChem / SiSi88OO2525HH1818 / Cycle / Cycle
Performance analysis of GA applications 19th June 2001
Computational Science and Engineering Department Daresbury Laboratory
NWChem / NWChem / SiSi88OO2525HH1818 / Diag / Diag
Performance analysis of GA applications 19th June 2001
Computational Science and Engineering Department Daresbury Laboratory
NWChem / NWChem / SiSi88OO2525HH1818 / subdiag / subdiag
Performance analysis of GA applications 19th June 2001
Computational Science and Engineering Department Daresbury Laboratory
NWChem / NWChem / SiSi88OO2525HH1818 / subdiag / subdiag
Performance analysis of GA applications 19th June 2001
Computational Science and Engineering Department Daresbury Laboratory
Parallel Implementations of GAMESS-UKParallel Implementations of GAMESS-UK
Extensive use of Global Array (GA) Tools and Parallel Extensive use of Global Array (GA) Tools and Parallel Linear Algebra from NWChem Project (EMSL)Linear Algebra from NWChem Project (EMSL)
SCF and DFT energies and gradientsSCF and DFT energies and gradients Replicated data, but …Replicated data, but … GA Tools for caching of I/O for restart and checkpoint filesGA Tools for caching of I/O for restart and checkpoint files Storage of 3-centre 2-e integrals in DFT Jfit Storage of 3-centre 2-e integrals in DFT Jfit Linear Algebra (via PeIGs, DIIS/MMOs, Inversion of 2c-2e matrix)Linear Algebra (via PeIGs, DIIS/MMOs, Inversion of 2c-2e matrix)
SCF second derivativesSCF second derivatives Distribution of <vvoo> and <vovo> integrals via GAsDistribution of <vvoo> and <vovo> integrals via GAs
MP2 gradientsMP2 gradients Distribution of <vvoo> and <vovo> integrals via GAsDistribution of <vvoo> and <vovo> integrals via GAs
Performance analysis of GA applications 19th June 2001
Computational Science and Engineering Department Daresbury Laboratory
GAMESS-UK: DFT S-VWN GAMESS-UK: DFT S-VWN Impact of Coulomb Fitting: Impact of Coulomb Fitting: Compaq AlphaServer SC /833 Compaq AlphaServer SC /833
Number of CPUs Number of CPUs
Measured Time (seconds) Measured Time (seconds)
Basis: DZV_A2 (Dgauss)Basis: DZV_A2 (Dgauss)A1_DFT Fit:A1_DFT Fit:
7128
1983
3785
1338
2172
9281365
751
0
2000
4000
6000
8000
16 32 64 128
JJEXPLICITEXPLICIT
JJFITFIT
3583
1100
1956
695
1113
484710
386
0
1000
2000
3000
4000
16 32 64 128
JJEXPLICITEXPLICIT
JJFITFIT
SiSi2626OO3737HH3636 1199/28181199/2818 SiSi2828OO6767HH3030 1687/39281687/3928
Performance analysis of GA applications 19th June 2001
Computational Science and Engineering Department Daresbury Laboratory
DFT Coulomb Fit - GAMESS-UKDFT Coulomb Fit - GAMESS-UK
Number of CPUs Number of CPUs
Measured Time (seconds) Measured Time (seconds)
SiSi2828OO6767HH3030 1687/39281687/3928SiSi2626OO3737HH3636 1199/28181199/2818
0
1000
2000
3000
4000
5000
16 32
Pentium Beowulf IICray T3E/1200EIBM SP/WH2-375Origin3800/R12k-400Alpha Linux ClusterAlphaServer SC/667AlphaServer SC/833
2393
3751
2041
1569
0
2000
4000
6000
8000
10000
16 32
Pentium Beowulf IICray T3E/1200EIBM SP/WH2-375SGI Origin3800/R12k-400Alpha Linux ClusterAlphaServer SC/667AlphaServer SC/833
Performance analysis of GA applications 19th June 2001
Computational Science and Engineering Department Daresbury Laboratory
718289
141
628
322
164
230194
202
0%
20%
40%
60%
80%
100%
16 32 64
DFT JFitDFT JFit Performance : Performance : Si26O37H36
Number of CPUs
JFit
XC
SCF
Cray T3E/1200E
SCF
XC
JFit
AlphaServer SC/833
1727685
266 179
1674834
458250
277259
293
0%
20%
40%
60%
80%
100%
16 32 64 128
436214
102 75
295
153
8043
303266
243213
0%
20%
40%
60%
80%
100%
16 32 64 128
SCF
XC
JFit
SGI Origin 3000/R12k-400
Number of CPUs
Performance analysis of GA applications 19th June 2001
Computational Science and Engineering Department Daresbury Laboratory
GAMESS-UK / GAMESS-UK / SiSi88OO2525HH1818 : 8 CPUs: : 8 CPUs:
One DFT CycleOne DFT Cycle
Performance analysis of GA applications 19th June 2001
Computational Science and Engineering Department Daresbury Laboratory
GAMESS-UK / GAMESS-UK / SiSi88OO2525HH1818 : 8 CPUs : 8 CPUs QQ††HQ HQ
(GAMULT2) and PEIGS(GAMULT2) and PEIGS
Performance analysis of GA applications 19th June 2001
Computational Science and Engineering Department Daresbury Laboratory
SummarySummary PNNL, Daresbury and PALLAS collaborationsPNNL, Daresbury and PALLAS collaborations Tool for Performance Analysis - VAMPIR & VAMPIR TraceTool for Performance Analysis - VAMPIR & VAMPIR Trace
Extended to handle GA ApplicationsExtended to handle GA Applications Applied in a number of DFT Calculations on Zeolite Fragments on Applied in a number of DFT Calculations on Zeolite Fragments on
a variety of high-end and commodity-based platformsa variety of high-end and commodity-based platforms Instrumentation of both NWChem and GAMESS-UK:Instrumentation of both NWChem and GAMESS-UK:
Distributed data (NWchem) Distributed data (NWchem) Replicated Data (GAMESS-UK)Replicated Data (GAMESS-UK) Analysis of GAs and PeIGsAnalysis of GAs and PeIGs
FindingsFindings non-intrusivenon-intrusive Tracing of substantial runs possibleTracing of substantial runs possible
Size of trace files in distributed data applicationsSize of trace files in distributed data applications
Use in quantifying scaling problemsUse in quantifying scaling problems e.g. GA_MULT2 in GAMESS-UKe.g. GA_MULT2 in GAMESS-UK
Performance analysis of GA applications 19th June 2001
Computational Science and Engineering Department Daresbury Laboratory
AcknowledgementsAcknowledgements
Bob GingoldBob GingoldAustralian National Univeristy Supercomputer FacilityAustralian National Univeristy Supercomputer Facility
Mario Deilmann, Mario Deilmann,
Hans Plum, Hans Plum,
Heinrich BockhorstHeinrich BockhorstPallasPallas