leistungsanalyse von rechnersystemen - tu dresden · leistungsanalyse von rechnersystemen 7....
TRANSCRIPT
Matthias Müller ([email protected])
Center for Information Services and High Performance Computing (ZIH)
Leistungsanalysevon Rechnersystemen
7. November 2006
Holger Brunst, Matthias Müller: Leistungsanalyse
2
Summary of Previous Lecture (1)
A ten step approach to a systematic performance evaluation:
1. State the goals of the study, define the system
2. List services and outcomes
3. Select metrics
4. List parameters that affect performance
5. Select factors to study
6. Select evaluation techniques
7. Select workload
8. Design sequence of experiments
9. Analyze and interpret data
10.Present results
Holger Brunst, Matthias Müller: Leistungsanalyse
3
Summary of previous lecture (2)
Commonly used metrics:
– Clock rate
– MIPS
– MFLOPS
– SPEC metrics
– Response time
– Throughput
– Utilization
– MTBF
– …
Holger Brunst, Matthias Müller: Leistungsanalyse
4
Summary of previous lecture (3)
Evaluation techniques
– Analytical Modeling
– Simulation
– Measurement
Holger Brunst, Matthias Müller: Leistungsanalyse
5
Summary of previous lecture (4)
Comparison of sequential and parallel algorithms
Speedup:
– n is the number of processors
– T1 is the execution time of the sequential algorithm
– Tn is the execution time of the parallel algorithm with n processors
Efficiency:
– Its value estimates how well-utilized p processors solve a given problem
– Usually between zero and one. Exception: Super linear speedup (later)
Sn =T1
Tn
E p =Sp
p
Holger Brunst, Matthias Müller: Leistungsanalyse
6
Amdahl’s Law
Find the maximum expected improvement to an overall system when onlypart of the system is improved
Serial execution time = s+p
Parallel execution time = s+p/n
– Normalizing with respect to serial time (s+p) = 1 results in:
• Sn = 1/(s+p/n)
– Drops off rapidly as serial fraction increases
– Maximum speedup possible = 1/s, independent of n the number ofprocessors!
Bad news: If an application has only 1% serial work (s = 0.01) then you willnever see a speedup greater than 100. So, why do we build system withmore than 100 processors?
What is wrong with this argument?
Sn =s + p
s +p
n
Holger Brunst, Matthias Müller: Leistungsanalyse
7
Scaled Speedup (Gustafson-Barsis’ Law)
Amdahl’s speedup equation assumes p is independent of n, in other wordsthe problem size remains the same
Gustafson-Barsis’ law states that any sufficiently large problem can beefficiently parallelized
More realistic to assume “runtime” remains the same, NOT the problem size
If the problem size scales up, does the serial part also increase?
Parallel execution time = s+p
Serial execution time = s+np
– Normalizing with respect to parallel execution time results in:
– Ssn = n+(1-n) s = p(n+1) + 1
Ssn =s + pn
s + p
Matthias Müller ([email protected])
Center for Information Services and High Performance Computing (ZIH)
Workload types, selection andcharacterization
Holger Brunst, Matthias Müller: Leistungsanalyse
9
Types of Workloads
Test workload:
– Any workload used in performance studies
– Real or synthetic
Real workload:
– Observed on a system being used for normal operation
– Cannot be repeated
– May contain sensitive data
Synthetic workload:
– Should be representative for a real workload
– Often smaller in size
Holger Brunst, Matthias Müller: Leistungsanalyse
10
Historical examples for test workloads
Addition instruction
Instruction mixes
Kernels
Synthetic programs
Application benchmarks
Holger Brunst, Matthias Müller: Leistungsanalyse
11
Popular benchmarks: Eratosthenes sieve algorithm
Algorithm to find prime numbers
Kernel
Simple
An algorithm is always independent of a computer language or specificimplementation
No very representative of today's use of computers
Holger Brunst, Matthias Müller: Leistungsanalyse
12
Popular benchmarks: Ackermann’s Function
Ackermann(n,m) := n+1 if m=0Ackermann(m-1,1) if n=0Ackermann(m-1, Ackermann(m,n-1))
Used to assess the efficiency of procedure calls
Ackermann(3,n) requires(512*4**(n-1)-15*2**(n+3)+9*n+37)/3 calls anda stack size 2**(n+3)-4
Holger Brunst, Matthias Müller: Leistungsanalyse
13
Popular benchmarks: Whetstone
Used at British Central Computer Agency
11 modules
Representative f 949 ALGOL programs
Available in ALGOL, FORTRAN, PL/I and other programs
See Curnow and Wichmann (1975)
Results in KWHIPS (Kilo Whetstone Instructions Per Second)
Workloads characteristics:
– Floating point intensive
– Cache friendly
– No I/O
Holger Brunst, Matthias Müller: Leistungsanalyse
14
Popular benchmarks: LINPACK
Developed by Jack Dongarra (1983) at ANL (now ICL, UTK)
Solves a dense system of linear equations
Algorithmic definition of the benchmark
Reference implementation available (HPL)
Makes have use of BLAS
One fixed dataset: 100x100
Used as the benchmark for the TOP500 list
Many vendors have its own hand-tuned implementation
Holger Brunst, Matthias Müller: Leistungsanalyse
15
Popular benchmarks: Dhrystone
Developed in 1984 by Reinhold Weicker at Siemens
Represents systems programming environments
Available in C, Pascal and Ada
Results are in Dhrystone Instructions Per Seconds (DIPS)
Includes ground rules for building and executing Dhrystone (run rules)
Holger Brunst, Matthias Müller: Leistungsanalyse
16
Popular Benchmarks: Lawrence Livermore Loops
24 separate tests
Largely vectorizable
Assembled at LLNL (see McMahon 1986)
Holger Brunst, Matthias Müller: Leistungsanalyse
17
Popular Benchmarks: Transaction Processing (TPC-C)
Successor of the Debit-Credit Benchmark
TPC-C is an on-line transaction processing benchmark
Results reports performance (tpmC) and price/performance ($/tmpC)
System reported has to be available to the customer (at that price)
Running the benchmarks requires a costly setup:
Holger Brunst, Matthias Müller: Leistungsanalyse
18
SPEC groups and benchmarks
Open Systems Group (desktop systems, high-end workstations and servers)
– CPU (CPU benchmarks)
– JAVA (java client and server side benchmarks)
– MAIL (mail server benchmarks)
– SFS (file server benchmarks)
– WEB (web Server benchmarks)
High Performance Group (HPC systems)
– OMP (OpenMP benchmark)
– HPC (HPC application benchmark)
– MPI (MPI application benchmark)
Graphics Performance Groups (Graphics)
– Apc (Graphics application benchmarks)
– Opc (OpenGL performance benchmarks)
Matthias Müller ([email protected])
Center for Information Services and High Performance Computing (ZIH)
Workload Selection
Holger Brunst, Matthias Müller: Leistungsanalyse
20
System under Study
Seems to be an easy thing to define
Be aware of different abstraction layers
Example ISO/OSI reference model for computer networks:
1. Application (mail, FTP)
2. Presentation (Data compression, ..)
3. Session (Dialogs)
4. Transport (Messages)
5. Network (Packets)
6. Datalink (Frames)
7. Physical (Bits)
Holger Brunst, Matthias Müller: Leistungsanalyse
21
Level of Detail of the workload description
Examples:
– Most frequent request (e.g. Addition)
– Frequency of request type (instruction mix)
– Time-stamped sequence of requests
– Average resource demand (e.g. 20 I/O requests per second)
– Distribution of resource demands (not only the average, but alsoprobability distribution)
Holger Brunst, Matthias Müller: Leistungsanalyse
22
Representativeness
After all benchmarks are not a merit of their own, they should represent realworkloads:
Different characteristics to consider:
– Arrival rate of requests
– Resource demands
– Resource usage profile (sequence and amounts of resources used by anapplication)
To be representative a test workload has to follow the user behavior in atimely fashion!!!
Matthias Müller ([email protected])
Center for Information Services and High Performance Computing (ZIH)Center for Information Services and High Performance Computing (ZIH)
SPEC Benchmarks
Vorlesung Leistungsanalyse
Holger Brunst, Matthias Müller: Leistungsanalyse
24
Outline
What is SPEC?
Who is SPEC?
Some SPEC benchmarks:
– SPEC CPU
– SPEC HPC
– SPEC OMP
– SPEC MPI
Summary
Matthias Müller ([email protected])
Center for Information Services and High Performance Computing (ZIH)
What and who is SPEC?
Holger Brunst, Matthias Müller: Leistungsanalyse
26
What is SPEC?
The Standard Performance Evaluation Corporation (SPEC) is a non-profitcorporation formed to establish, maintain and endorse a standardized set ofrelevant benchmarks that can be applied to the newest generation of high-performance computers. SPEC develops suites of benchmarks and alsoreviews and publishes submitted results from our member organizations andother benchmark licensees.
For more details see http://www.spec.org
Holger Brunst, Matthias Müller: Leistungsanalyse
27
SPEC Members
SPEC Members:
3DLabs * Acer Inc. * Advanced Micro Devices * Apple Computer, Inc. * ATI Research* Azul Systems, Inc. * BEA Systems * Borland * Bull S.A. * CommuniGate Systems *Dell * EMC * Exanet * Fabric7 Systems, Inc. * Freescale Semiconductor, Inc. * FujitsuLimited * Fujitsu Siemens * Hewlett-Packard * Hitachi Data Systems * Hitachi Ltd. *IBM * Intel * ION Computer Systems * JBoss * Microsoft * Mirapoint * NEC - Japan *Network Appliance * Novell * NVIDIA * Openwave Systems * Oracle * P.A. Semi *Panasas * PathScale * The Portland Group * S3 Graphics Co., Ltd. * SAP AG * SGI *Sun Microsystems * Super Micro Computer, Inc. * Sybase * Symantec Corporation *Unisys * Verisign * Zeus Technology *
SPEC Associates:
California Institute of Technology * Center for Scientific Computing (CSC) * DefenceScience and Technology Organisation - Stirling * Dresden University of Technology *Duke University * JAIST * Kyushu University * Leibniz Rechenzentrum - Germany *National University of Singapore * New South Wales Department of Education andTraining * Purdue University * Queen's University * Rightmark * Stanford University *Technical University of Darmstadt * Texas A&M University * Tsinghua University *University of Aizu - Japan * University of California - Berkeley * University of CentralFlorida * University of Illinois - NCSA * University of Maryland * University of Modena* University of Nebraska, Lincoln * University of New Mexico * University of Pavia *University of Stuttgart * University of Texas at Austin * University of Texas at El Paso *University of Tsukuba * University of Waterloo * VA Austin Automation Center *
Holger Brunst, Matthias Müller: Leistungsanalyse
28
SPEC members in Dresden: Workshop June 2007
Holger Brunst, Matthias Müller: Leistungsanalyse
29
SPEC groups
Open Systems Group (desktop systems, high-end workstations and servers)
– CPU (CPU benchmarks)
– JAVA (java client and server side benchmarks)
– MAIL (mail server benchmarks)
– SFS (file server benchmarks)
– WEB (web Server benchmarks)
High Performance Group (HPC systems)
– OMP (OpenMP benchmark)
– HPC (HPC application benchmark)
– MPI (MPI application benchmark)
Graphics Performance Groups (Graphics)
– Apc (Graphics application benchmarks)
– Opc (OpenGL performance benchmarks)
Holger Brunst, Matthias Müller: Leistungsanalyse
30
SPEC HPG = SPEC High-Performance Group
Founded in 1994
Mission: To establish, maintain, and endorse a suite ofbenchmarks that are representative of real-world high-performance computing applications.
SPEC/HPG includes members from both industry and academia.
Benchmark products:
– SPEC OMP (OMPM2001, OMPL2001)
– SPEC HPC2002 released at SC 2002
– SPEC MPI (under development)
Holger Brunst, Matthias Müller: Leistungsanalyse
31
Currently active SPEC HPG Members
Fujitsu
HP
IBM
Intel
SGI
SUN
UNISYS
University of Purdue
Technische Universität Dresden
Holger Brunst, Matthias Müller: Leistungsanalyse
32
HPG (High Performance Group) Benchmark Suites
OMPL2001
Founding of
SPEC HPG
HPC96
OMP2001
HPC2002
MPI2007
Jan 1994 1996 June 2001 June 2002 Jan 2003 2007
Matthias Müller ([email protected])
Center for Information Services and High Performance Computing (ZIH)
Overview and Positioning
Holger Brunst, Matthias Müller: Leistungsanalyse
34
Where is SPEC Relative to Other Benchmarks ?There are many metrics, each one has its purpose
Raw machine performance: Tflops
Microbenchmarks: Stream
Algorithmic benchmarks: Linpack
Compact Apps/Kernels: NAS benchmarks
Application Suites: SPEC
User-specific applications: Custom benchmarks
Computer Hardware
Applications
Holger Brunst, Matthias Müller: Leistungsanalyse
35
Why do we need benchmarks?
Identify problems: measure machine properties
Time evolution: verify that we make progress
Coverage:Help the vendors to have representative codes:
– Increase competition by transparency
– Drive future development (see SPEC CPU2000)
Relevance:Help the customers to choose the right computer
Holger Brunst, Matthias Müller: Leistungsanalyse
36
Comparison of different benchmark classes
+++00Micro
+++0-Algorithmic
++00Kernels
Apps
SPEC
00++-
++++
Timeevolution
Identifyproblems
relevancecoverage
Matthias Müller ([email protected])
Center for Information Services and High Performance Computing (ZIH)
SPEC CPU 2006
From John Henning’s talk at SPEC Workshop
June 2007, Dresden
Holger Brunst, Matthias Müller: Leistungsanalyse
38
SPEC CPU2006 History
Released August 2006
Replaces CPU2000 (retired February 2007)
5th CPU benchmark
– SPECmark (later called “CPU89”)
– SPEC92 (later called “CPU92”)
– CPU95
– CPU2000
– CPU2006
Note: these updates are required to stay representative
Question to the audience: What kind of application would you add?
Holger Brunst, Matthias Müller: Leistungsanalyse
39
CINT 2006
Benchmark L Application Area Brief Description
400.perlbench C Programming Language Derived from Perl V5.8.7. The workload includes SpamAssassin, MHonArc (an email indexer), and specdiff (SPEC's tool that checks benchmark outputs).
401.bzip2 C Compression Julian Seward's bzip2 version 1.0.3, modified to do most work in memory, rather than doing I/O.
403.gcc C C-Compiler Based on gcc Version 3.2, generates code for Opteron.
429.mcf C Combinatorial Optim. Vehicle scheduling. Uses a network simplex algorithm (which is also used in commercial products) to schedule public transport.
445.gobmk C Artificial Intelligence: Go Plays the game of Go, a simply described but deeply complex game.
456.hmmer C Search Gene Sequence Protein sequence analysis using profile hidden Markov models (profileHMMs)
458.sjeng C AI: chess A highly-ranked chess program that also plays several chess variants.
462.libquantum C Physics Quantum Comp. Simulates a quantum computer, running Shor's polynomial-time factorization algorithm.
464.h264ref C Video Compression A reference implementation of H.264/AVC, encodes a videostream using2 parameter sets. The H.264/AVC standard is expected to replace MPEG2
471.omnetpp C++ Discrete Event Simulation Uses the OMNet++ discrete event simulator to model a large Ethernetcampus network.
473.astar C++ Path-finding Algorithms Pathfinding library for 2D maps, including the well known A* algorithm.
483.xalancbmk C++ XML Processing A modified version of Xalan-C++, which transforms XML documents toother document types.
Holger Brunst, Matthias Müller: Leistungsanalyse
40
CFP 2006 (part I)
Benchmark Lang. Application Area Brief Description
410.bwaves Fortran Fluid Dynamics Computes 3D transonic transient laminar viscous flow.
416.gamess Fortran Quantum Chemistry. Implements a wide range of quantum chemical computations. The SPECworkload does self-consistent field calculations using the Restricted Hartree Fock method, Restricted open-shell Hartree-Fock, and Multi-Configuration Self-Consistent Field
433.milc C Physics/QCD A gauge field generating program for lattice gauge theory with dynamicalquarks.
434.zeusmp Fortran Physics / CFD ZEUS-MP is a computational fluid dynamics code developed at the Laboratory for Computational Astrophysics (NCSA, University of Illinoisat Urbana-Champaign) for the simulation of astrophysical
phenomena.
435.gromacs C, Fortran Biochemistry Molecular dynamics, i.e. simulate Newtonian equations of motion for hundreds to millions of particles. The test case simulates protein Lysozyme in a solution.
436.cactusADM C,Fortran Physics / General Relativity Solves the Einstein evolution equations using a staggered-leapfrog numerical method
437.leslie3d Fortran Fluid Dynamics Computational Fluid Dynamics (CFD) using Large-Eddy Simulations withLinear-Eddy Model in 3D. Uses MacCormack Predictor-Corrector timeintegration
444.namd C++ Biology Molecular Dynamics Simulates biomolecular systems. Test case has 92,224 atoms of apolipoprotein A-I.
447.dealII C++ FE Analysis deal.II is a C++ library targeted at adaptive finite elements and error estimation. The testcase solves a Helmholtz-type equation with non-constant coefficients.
Holger Brunst, Matthias Müller: Leistungsanalyse
41
CFP 2006 (part II)
Benchmark Language Application Area Brief Description
450.soplex C++ Linear Programming, Solves a linear program using a simplex algorithm and sparse linear algebra. Test Optimization cases includerailroad planning and military airlift models.
453.povray C++ Image Ray-tracing Image rendering. The testcase is a 1280x1024 anti-aliased image of a landscape with some abstract objectswith textures using a Perlin noise function.
454.calculix C, F Structural Mechanics Finite element code for 3D structural applications. Usesthe SPOOLES solver library.
459.GemsFDTD F Electromagnetics Solves Maxwell equations in 3D using finite-difference time-domain (FDTD) method.
465.tonto Fortran Quantum Chemistry An open source quantum chemistry package, using an object-oriented design in Fortran 95. The test case placesa constraint on a molecular Hartree-Fock wavefunctioncalculation to better match experimental X-ray diffractiondata.
470.lbm C Fluid Dynamics Implements the "Lattice-Boltzmann Method" to simulateincompressible fluids in 3D
481.wrf C,F Weather Weather modeling from scales of meters to thousands ofkilometers. The test case is from a 30km area over 2 days.
482.sphinx3 C Speech recognition A widely-known speech recognition system from Carnegie Mellon University
Holger Brunst, Matthias Müller: Leistungsanalyse
42
Code growth
Holger Brunst, Matthias Müller: Leistungsanalyse
43
Metrics
Speed
– SPECint_base2006 (Required Base result)
– SPECint2006 (Optional Peak result)
– SPECfp_base2006 (Required Base result)
– SPECfp2006 (Optional Peak result)
Throughput
– SPECint_rate_base2006 (Required Base result)
– SPECint_rate2006 (Optional Peak result)
– SPECfp_rate_base2006 (Required Base result)
– SPECfp_rate2006 (Optional Peak result)
Holger Brunst, Matthias Müller: Leistungsanalyse
44
Speed Metric for Single Benchmark
For each benchmark in suite, compute ratio vs. time on a reference system
– A 1997 Sun system with 296 MHz UltraSPARC II
– Similar but not identical to CPU2000 ref machine
Example:
– 400.perlbench on a year 2006 iMac took 948 seconds
– On the reference system, took 9770 seconds
– SPECratio = 10.3 (9770/948)
– If your workload looks like perl, you might find that this modern iMacruns around 10x faster than a state-of-the-1997-art workstation.
Holger Brunst, Matthias Müller: Leistungsanalyse
45
Overall Speed Metric
To obtain the overall speed metrics: geometric mean of the individualSPECratios
Why geometric mean?
Because this is the best answer to the question
“Without knowing how much time I will spend in text processing vs. networkmapping vs. compiling vs. video compression, please tell me about howmuch faster this machine will be than the reference system.”
Holger Brunst, Matthias Müller: Leistungsanalyse
46
Motivation for Throughput Metric
Differs from speed
Stove analogy:
– One big flame cooks one big pot with one hogshead in one hour
– 6 little flames cook 6 little pots, each holding one firkin, in 15 minutes
– Which is better?
Well, big flame does ~250 liters/hour; each little flame does only ~40 * 4 =160 liters/hour
Holger Brunst, Matthias Müller: Leistungsanalyse
47
Throughput vs. Speed
Big flame does ~250 liters/hour; each little flame does only ~40 * 4 = 160liters/hour
Alternatives:
– If I only need to heat up an UNOPENED container holding 1 gallon ofsoup, supper can be served most quickly if I put it on the big flame
– If I need to heat up one butt of soup (=2 hogsheads), and if I can openthe container, I'd be better off using many small flames
In IT business:
– Processing one image in Photoshop or Gimp vs.
– Rendering the next movie with thousands of pictures
Holger Brunst, Matthias Müller: Leistungsanalyse
48
CPU2006 Throughput Metric
Formula:the number of copies run * reference time for the benchmark / elapsed timein seconds
Example:Sun Fire E25K runs 144 copies of 400.perlbench in1066 seconds:144 * 9770 / 1066 = 1320
Holger Brunst, Matthias Müller: Leistungsanalyse
49
Summary of Metrics
Two different kind of metrics
– speed (single application turnaround)
– rate (thoughput)
Run rules make the different between base and peak
– Base: conservative optimization, less freedom
– Peak: more aggressive optimization, more freedom
Tow benchmark sets SPECint and SPECfp 23 = 8 different metrics
If you look at the single application results you get: 2*2*(12+17)=116 different metics
Holger Brunst, Matthias Müller: Leistungsanalyse
50
Example for Run Rules
Base does not allow feedback directed optimization (still legal in peak)
An unlimited number of flags may be set in base,
– Why? Because flag counting is not worth arguing about.
– For example, is -fast:np27 one flag, two, or three? Prove it.
– What if it's -fast_np27 ?
– What it it’s –fast np27 or –fast –np27 ?
Holger Brunst, Matthias Müller: Leistungsanalyse
51
SPEC CPU2000 Result
Matthias Müller ([email protected])
Center for Information Services and High Performance Computing (ZIH)
SPEC OMP
Holger Brunst, Matthias Müller: Leistungsanalyse
53
SPEC OMP
Benchmark suite developed by SPEC HPG
Benchmark suite for performance testing of shared memoryprocessor systems
Uses OpenMP versions of SPEC CPU2000 benchmarks
SPEC OMP mixes integer and FP in one suite
OMPM is focused on 4-way to 16-way systems
OMPL is targeting 32-way and larger systems
Holger Brunst, Matthias Müller: Leistungsanalyse
54
SPEC OMP Applications
Code Applications Language linesammp Molecular Dynamics C 13500
applu CFD, partial LU Fortran 4000
apsi Air pollution Fortran 7500
art Image Recognition\
neural networks C 1300
fma3d Crash simulation Fortran 60000
gafort Genetic algorithm Fortran 1500
galgel CFD, Galerkin FE Fortran 15300
equake Earthquake modeling C 1500
mgrid Multigrid solver Fortran 500
swim Shallow water modeling Fortran 400
wupwise Quantum chromodynamics Fortran 2200
Holger Brunst, Matthias Müller: Leistungsanalyse
55
CPU2000 vs. OMPM2001
Characteristic CPU2000 OMPM2001
Max. working set 200 MB 1.6 GB Memory needed 256 MB 2 GB Benchmark runtime 30 min @ 300 MHz 5 hrs @ 300 MHz
Language C, C++, F77, F90 C, F90, OpenMP Focus Single CPU < 16 CPU system System type Cheap desktop MP workstation Runtime 24 hours 34 hours
Runtime 1 CPU 24 hours 140 hours Run modes Single and rate Parallel Number benchmarks 26 11 Iterations Median 3 or more Worst of 2, median of 3
Source mods Not allowed Allowed Baseline flags Max of 4 Any, same for all Reference system 1 CPU @ 300 MHz 4 CPU @ 350 MHz
Holger Brunst, Matthias Müller: Leistungsanalyse
56
CPU2000 vs OMPL2001
Characteristic CPU2000 OMPL2001
Max. working set 200 MB 6.5 GB Memory needed 256 MB 8 GB Benchmark runtime 30 min @ 300 MHz 9 hrs @ 300 MHz
Language C, C++, F77, F90 C, F90, OpenMP Focus Single CPU > 16 CPU system System type Cheap desktop Engineerin g MP sys Runtime 24 hours 75 hours
Runtime 1 CPU 24 hours 1000 hours Run modes Single and rate Parallel Number benchmarks 26 9 Iterations Median 3 or more 2 or more
Source mods Not allowed Allowed Baseline flags Max of 4 Any, same for all Reference s ystem 1 CPU @ 300 MHz 16 CPU @ 300 MHz
Holger Brunst, Matthias Müller: Leistungsanalyse
57
Program Memory Footprints
OMPM2001
(Mbytes)
OMPL2001
(Mbytes)
wupwise 1480 5280 swim 1580 6490
mgrid 450 3490
applu 1510 6450
galgel 370 equake 860 5660
apsi 1650 5030
gafort 1680 1700
fma3d 1020 5210 art 2760 10670
ammp 160
Holger Brunst, Matthias Müller: Leistungsanalyse
58
SPEC OMP Results (January 2006)
141 submitted results for OMPM
39 submitted results for OMPL
32 KB64 KB16 KB1.5 MBL1 Data
8 MB8 MB256 KB-L2
--6144 KB-L3
32 KB32 KB16 KB0.75 MBL1 Inst
40012001500875Speed
R12000UltraSPARC IIIItanium2PA-8700+CPU
O3800Fire 15KSuperdomeSuperdomeArchitecture
SGISUNHPHPVendor
Holger Brunst, Matthias Müller: Leistungsanalyse
59
SPEC OMPL Results: Applications with scaling to 128
Holger Brunst, Matthias Müller: Leistungsanalyse
60
SPEC OMPL Results: Superlinear scaling of applu
Holger Brunst, Matthias Müller: Leistungsanalyse
61
SPEC OMPL Results: Applications with scaling to 64
Matthias Müller ([email protected])
Center for Information Services and High Performance Computing (ZIH)
SPEC MPI2007
Holger Brunst, Matthias Müller: Leistungsanalyse
63
An application benchmark suite that measures:
– Type of computer processor
– Number of computer processors
– Communication interconnect
– Memory architecture
– Compilers
– MPI library performance
– File system performance
Identifying Candidate Applications
– From SPEC CPU2006
– With a search for candidate call
MPI2007 design goals: benchmark for distributed memory
Holger Brunst, Matthias Müller: Leistungsanalyse
64
Comparison of Different Benchmarks using MPI
CF77,CF77,F90,C,C++Language
~600~400~2400#MPI calls in thecode
~44~36~59#different MPI calls inthe code
47.200 lines28.000 lines~530.000 linesCode size
7813Number ofapplications
HPCCNPBSPEC MPI
Holger Brunst, Matthias Müller: Leistungsanalyse
65
Application Fields
– Computation fluid dynamics
– Quantum chromodynamics
– Climate modeling
– Ray tracing
– Molecular Dynamics
– Weather prediction
– Heat transfer
– Hydrodynamics
– Flow Simulation
Holger Brunst, Matthias Müller: Leistungsanalyse
66
MPI2007 Development
Participating Members
•AMD, Fujitsu, HP, IBM, INTEL
•QLogic (PathScale), SGI, SUN
•University of Dresden, Lawrence Livermore Lab
Release date expected to be July, 2007
We are always looking for new members to help develop benchmarks
Holger Brunst, Matthias Müller: Leistungsanalyse
67
MPI2007 Benchmark Goals
–Runs on Clusters or SMP’s
–Validates for correctness and measures performance
–Supports 32-bit or 64-bit OS/ABI.
–Consists of applications drawn from National Labs and University researchcenters
–Supports a broad range of MPI implementations and Operating systemsincluding Windows, Linux, Proprietary Unix
–Has a runtime of ~1 hour per benchmark test at 16 ranks using GigE with 1GB memory footprint per rank
–Scales to 128 ranks
–Is extensible to future large and extreme data sets planned to cover largernumber of ranks.
Holger Brunst, Matthias Müller: Leistungsanalyse
68
MPI2007 – tested for portability
– Architectures:
• Opteron, Xeon, Itanium2, PA-Risc, Power5, Sparc
– Interconnects:
• Ethernet, Infiniband, Infinipath, SGI NUMAlink, and shared memory.
– Operating systems
• Linux (RH FC3, SLES9/10,Suse 9.3), Windows CCS, HPUX, Solaris,AIX
– MPI implementations
• HP-MPI, MPICH, MPICH2, Open MPI, IBM-MPI,Intel MPI, MPICH-GM, MVAPICH, Fujitsu MPI, InfiniPath MPI, SGIMPT
– Compilers:
• SUN Studio, Fujitsu, Intel, PathScale, PGI, HP, and IBM compilers.
Holger Brunst, Matthias Müller: Leistungsanalyse
69
MPI2007 – tested for scalability
– Scalable from 16 to 128 ranks (processes) for medium data set
– Runtime of 1 hour per benchmark test at 16 ranks using GigE on anunspecified reference cluster.
– Memory footprint should be < 1GB per rank at 16 ranks.
– Exhaustively tested for each rank count- 12- 15 -> 130- 140, 160, 180, 200, 225, 256, 512
Holger Brunst, Matthias Müller: Leistungsanalyse
70
Overview of the applications
SSOR1372F905671137.lu
Astrophysical CFD21639C,F9044441132.zeusmp2
density-functional theory20155F9091585130.socorro
Eulerian hydrodynamics1342F906468129.tera_tf
Geophysical FEM1858F77,C30935
128.GAPgeofe
m
Weather forecast23132F90,C163462127.wrf2
Molecular dynamics25625C++6796126.lammps
Ray tracing1617C15512122.tachyon
Geophysical fluid
dynamics17158F9069203121.pop2
CFD15239F90,C44524115.fds4
Electrodynamic simulation16237F9021858113.GemsFDTD
Combustion1343F77,F9010503107.leslie3d
Lattice QCD1851C17987104.milc
callscall sites
AreaMPIMPILanguageLOCCode
Holger Brunst, Matthias Müller: Leistungsanalyse
71
MPI2007 Benchmark dynamic message call counts
MPI_Allgather 303040 32 512
MPI_Allgatherv 7936
MPI_Allreduce 17700 140832 23628416 1696 2002016 60416 36992 12864 224
MPI_Barrier 62 1088 160 320 8640 32 64 15520 9760 96 32
MPI_Bcast 122 292000 9664 1888 67488 352 1184 1248 288
MPI_Cart_create 32 32
MPI_Comm_create 96
MPI_Comm_dup 32 224
MPI_Comm_free 32
MPI_Comm_split 32 32 32 32
MPI_Gather 8512
MPI_Iprobe
MPI_Irecv 359340 3201600 5.58E+08 196544 6508380 6015144 1991616 5266164 845056 19000
MPI_Irsend
MPI_Isend 359340 5.58E+08 601514 4 845056
MPI_Issend
MPI_Probe
MPI_Recv 3270 35371 9152 10106 360 7600320
MPI_Reduce 64 128 1152 64
MPI_Scan 32
MPI_Send 3201600 3270 35371 205696 6518486 1991976 5266164 7619320
MPI_Send_init 16158
MPI_Sendrecv 1204000
MPI_Ssend
MPI_Start 16158
MPI_Startall 1
MPI_Test
MPI_Testany 522
MPI_Wait 718680 3201600 196544 6508380 1991616 19000
MPI_Waitall 151264 3.19E+08 32 1394816 249888
MPI_Waitany 5266164
Holger Brunst, Matthias Müller: Leistungsanalyse
72
MPI2007 (32 ranks)Characteristics
Elapsed Time
%User
Time %MPI Time
104.milc 2142.44 82% 18%
107.leslie3d 3997.10 72% 28%
113.GemsFDTD 1682.58 67% 33%
115.fds4 1926.18 91% 9%
121.pop2 2016.27 64% 36%
122.tachyon 2034.54 99% 1%
126.lammps 1841.00 94% 6%
127.wrf2 3085.30 74% 26%
128.GAPgeofem 653.17 86% 14%
129.tera_tf 1116.59 85% 15%
130.socorro 1203.73 96% 4%
132.zeusmp2 1400.41 83% 17%
137.lu 733.14 94% 6%
Holger Brunst, Matthias Müller: Leistungsanalyse
73
Pt2Pt Communication Statistics: 122.tachyon (ray tracing)
Holger Brunst, Matthias Müller: Leistungsanalyse
74
Pt2Pt Communication Statistics: 107.leslie3D (combustion)
Holger Brunst, Matthias Müller: Leistungsanalyse
75
Pt2Pt Communication Statistics: 113.GemsFDTD (electrodynamics)
Holger Brunst, Matthias Müller: Leistungsanalyse
76
Message Length Statistics (Pt2Pt)
0
20
40
60
80
100
120
140
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
137
132
130
129
127
126
122
115
113
107
101
Matthias Müller ([email protected])
Center for Information Services and High Performance Computing (ZIH)
Available Results
Holger Brunst, Matthias Müller: Leistungsanalyse
78
Available Results
– AMD A2210 Reference Platform (16 cores)
• Gigabit Ethernet
• Single Core AMD Opteron 848, 2.2 GHz
– SGI Altix 4700 (16-128 cores)
• SGI Numalink, SGI MPT 1.15
• Dual-Core Intel Itanium II 9040, 1.6 GHz
– HP Proliant BL460c Blade Cluster Platform 3000 BL (16-256 cores)
• Infiniband DDR, HP-MPI 2.2.5
• Dual-Core Intel Xeon 5160, 3.0 GHz
– QLogic, U. Cambridge Darwin Cluster (32-512 cores)
• Infinipath, QLogic Infinipath MPI library 2.0
• Dual-Core Intel Xeon 5160, 3.0 GHz
– QLogic, AMD Emerald Cluster (32-512 cores)
• Infinipath, QLogic Infinipath MPI library 2.1
• Dual-Core AMD Opteron 290, 2.8 GHz
Holger Brunst, Matthias Müller: Leistungsanalyse
79
Scales to 128 , works on 512
SPECmpiM Results on U. Cambridge's Darwin Cluster
0
10
20
30
40
50
60
70
80
104.m
ilc
107.
leslie3d
113.Gem
sFDTD
115.
fds4
121.
pop2
122.
tach
yon
126.lam
mps
127.wrf2
128.
GAPge
ofem
129.
tera
_tf
130.
soco
rro
132.
zeus
mp2
137.
lu
SPECmpiM_b
ase
SP
EC
mp
iM R
ati
o
32 64 128 256 512Ranks / Cores:
Holger Brunst, Matthias Müller: Leistungsanalyse
80
Scalability on U. Cambridge’s Darwin Cluster (II)Scaling QL Darwin
0
100
200
300
400
500
600
700
800
900
0 100 200 300 400 500 600
104.milc
107.leslie3d
113.GemsFDTD
115.fds4
121.pop2
122.tachyon
126.lammps
127.wrf2
128.GAPgeofem
129.tera_tf
130.socorro
132.zeusmp2
137.lu
IDEAL
Holger Brunst, Matthias Müller: Leistungsanalyse
81
Scalability on HP ClusterScaling on HP ProLiant
0
100
200
300
400
500
600
0 50 100 150 200 250 300
#ranks
104.milc
107.leslie3d
113.GemsFDTD
115.fds4
121.pop2
122.tachyon
126.lammps
127.wrf2
128.GAPgeofem
129.tera_tf
130.socorro
132.zeusmp2
137.lu
IDEAL
Holger Brunst, Matthias Müller: Leistungsanalyse
82
Summary and Conclusion
SPEC MPI2007 properties:
– Application benchmark with 13 different codes
– Run and reporting rules for reproducibility
– Tested on a wide range of platforms:
• CPU and Node Architectures
• Interconnects
• Compilers
• MPI implementations
– Available dataset (medium) scales to 128 ranks
– Next steps:
• Large dataset with enhanced scalability for larger systems
• …
Matthias Müller ([email protected])
Center for Information Services and High Performance Computing (ZIH)
Use Cases
Holger Brunst, Matthias Müller: Leistungsanalyse
84
Use cases
– Performance trends
– Compiler and performance
– Comparing different Itanium systems
– Comparing different system generations
Holger Brunst, Matthias Müller: Leistungsanalyse
85
SPEC performance trends (performance per thread)
SPEC History
y = 3E-17e 0,0012x
1
10
100
1000
10000
28.08.1999 15.03.2000 01.10.2000 19.04.2001 05.11.2001 24.05.2002 10.12.2002 28.06.2003 14.01.2004 01.08.2004 17.02.2005
Time
SP
EC 8w ay
Expon
Holger Brunst, Matthias Müller: Leistungsanalyse
86
Where Does the Performance Go? orWhy Should I Care About the Memory Hierarchy?
1
100
10000
1000000
1980
1982
1984
1986
1988
1990
1992
1994
1996
1998
2000
2002
2004
Year
Pe
rfo
rm
an
ce
Processor-DRAM Memory Gap (latency) Proc
60%/yr.
(2X/1.5yr)
DRAM
9%/yr.
(2X/10 yrs)
“Moore’s Law”
Processor-Memory
Performance Gap:
(grows 50% / year)
CPU
DRAM
Holger Brunst, Matthias Müller: Leistungsanalyse
87
Comparison OMPM base compilers
0
10000
20000
30000
40000
50000
60000
70000
310.
wup
wis
e_m
312.
swim
_m
314.
mgr
id_m
316.
appl
u_m
318.
galg
el_m
320.
equa
ke_m
324.
apsi
_m
326.
gaf
ort
_m
328.
fma3
d_m
330.
art_
m
332.
amm
p_m
NEC ?/7.0
NEC 4.1/8.0
Intel 8.1
Holger Brunst, Matthias Müller: Leistungsanalyse
88
Influence of compilers on OMPM base 32-way results
0
5000
10000
15000
20000
25000
30000
35000
40000
Alt
ix (
9M)
pSer
ies
690
p5 5
70
pSer
ies
690T
alph
aser
ver
asam
a 8.
1
Alt
ix
sup
erd
om
e
Asa
ma
8.0
pri
mep
ow
er *
PA
-su
per
do
me
Ori
gin
Asa
ma
7.0
SG
I R12
K
Holger Brunst, Matthias Müller: Leistungsanalyse
89
Comparison OMPM on 32-way 1.5 GHz Itanium
0
10000
20000
30000
40000
50000
60000
70000
80000
310.
wup
wis
e_m
312.
swim
_m
314.
mgr
id_m
316.
appl
u_m
318.
galg
el_m
320.
equa
ke_m
324.
apsi
_m
326.
gaf
ort
_m
328.
fma3
d_m
330.
art_
m
332.
amm
p_m
asama
altix
superdome
Holger Brunst, Matthias Müller: Leistungsanalyse
90
SMP Performance Gain Itanium/Itanium 2
0
0,5
1
1,5
2
2,5
3
wu
pw
ise
swim
mgr
id_m
appl
u_m
galg
el_m
equa
ke_m
apsi
_m
gafo
rt_m
fma3
d_m
art_
m
amm
p_m
Star
CD
Sta
rCD
larg
e
Par
apyr U
G
uran
us
Holger Brunst, Matthias Müller: Leistungsanalyse
91
45.7cm
38.6cmCPU
1985 1990 1995 1998
Pe
rfo
rma
nce
BipolarWater-cooled
CMOSAir-cooled
Multi Nodes
Large scalecluster
>100nodes
SX-3
SX-5
Over 1GFLOPPer Node
SX-6/7
SX-1/2
SX-4
Technology
2cm
2cm
SX-8
Massive scale cluster>500nodes
2004
Single modulenode
Single ChipVector Processor
Multi CPUs
Architecture
The history of NEC SX series
2001
Holger Brunst, Matthias Müller: Leistungsanalyse
92
Performance Properties of Different SX systems
324 GB/s72 GF/s36 GB/s9 GF/s2002SX-6+
16 GF/s
8 GF/s
4 GF/s
2 GF/s
CPU perf.
128 GF/s
64 GF/s
64 GF/s
64 GF/s
Node perf.
512 GB/s
256 GB/s
512 GB/s
512 GB/s
Mem. Band/Node
16 GB/s1996SX-4
64 GB/s2004SX-8
32 GB/s2001SX-6
32 GB/s1999SX-5e
Mem band/ CPUAvailabilitySystem
Factor 2 in
two years
Factor 2 in
eight years
Holger Brunst, Matthias Müller: Leistungsanalyse
93
Properties of SPEC codes on vector systems
4649.60.06CEquake
27245.1492.57FGalgel
480211.0499.14FMgrid
152034.1781.31FApplu
164823.0276.70FApsi
168059.6040.25FGafort
148858.7487.34FWupwise
1584253.4899.75FSwim
176102.7976.67CAmmp
272242.1432.06CArt
10408.9510.29FFma3d
MEM (MB)VlenVratioLangName
Holger Brunst, Matthias Müller: Leistungsanalyse
94
Expectations
Swim, mgrid and maybe galgel should perform well
Equake, fma3d and art should perform poorly
However, the focus was not on absolute, but relative performance andscalability
Holger Brunst, Matthias Müller: Leistungsanalyse
95
SPEC efficiency on SX
0,00%
10,00%
20,00%
30,00%
40,00%
50,00%
60,00%
70,00%
SX5
SX5 0,63% 67,89% 52,46% 6,94% 13,71% 1,17% 4,21% 0,15% 1,14% 0,92% 1,88%
wupwi
seswim mgrid applu galgel
equak
eapsi gafort fma3d art ammp
Holger Brunst, Matthias Müller: Leistungsanalyse
96
Performance measurements
All performance is reported relative to the performance of one thread on SX-4
Number of threads used:
– 1,2,4,8,16,32 on SX-4
– 1,2,4,8,16 on SX-5
– 1,2,4,8 on SX-6+
– 1,2,4,8 on SX-8
Holger Brunst, Matthias Müller: Leistungsanalyse
97
Wupwise – expected behavior
wupwise
0
10
20
30
40
50
60
0 5 10 15 20 25 30
SX-4
SX-5
SX-6
SX-8
Same node
performance
of SX-4/5/6
Holger Brunst, Matthias Müller: Leistungsanalyse
98
Art – improves better than peak performance
art
0
10
20
30
40
50
60
0 5 10 15 20 25 30
SX-4
SX-5
SX-6
SX-8
Art benefits from
improvements of
scalar unit
Holger Brunst, Matthias Müller: Leistungsanalyse
99
Swim – surprisingly improves with every generation
swim
0
10
20
30
40
50
60
0 5 10 15 20 25 30
SX-4
SX-5
SX-6
SX-8
Compute
bound on SX-4
and SX-5 !
Holger Brunst, Matthias Müller: Leistungsanalyse
100
Mgrid – large improvements from SX-6+ to SX-8
mgrid
0
10
20
30
40
50
60
0 5 10 15 20 25 30
SX-4
SX-5
SX-6
SX-8
Improved
stride 2
memory
access
Holger Brunst, Matthias Müller: Leistungsanalyse
101
Not much improvement from SX-4 to 5 and 6 to 8
ammp
0
10
20
30
40
50
60
0 5 10 15 20 25 30
SX-4
SX-5
SX-6
SX-8
Holger Brunst, Matthias Müller: Leistungsanalyse
102
Explanation for ammp improvements
Ammp contains a lot of locks
Lock performance (measured by EPCC microbenchmarks)
1.213.4013.5 micro sSX-8
12.821.234.3 micro sSX-6+
Ammpratio
AmmpLock RatioLock
Holger Brunst, Matthias Müller: Leistungsanalyse
103
General observations
With the exception of equake and galgel the applications show goodscalability
Peak performance improvements
– realized to 87% to 96% for 1 thread
– realized to 81% to 89% for 8 threads
On average an SX-8 CPU is 6.14 times faster than an SX-4 CPU (peak ratio is8)
No significant difference between scalar and vector codes
Matthias Müller ([email protected])
Center for Information Services and High Performance Computing (ZIH)
Summary
Holger Brunst, Matthias Müller: Leistungsanalyse
105
Summary – What you should have learned
– There are many different benchmark approaches: microbenchmarks, kernels,applications…
– SPEC benchmarks are application or at least application oriented benchmarks,designed to represent current workloads
• An update is required after a few years
– SPEC benchmarks are used to:
• Measure and compare performance of systems
• Drive future development
• …
– Different metrics are used (base/peak, speed/throughput)
– Many different factors have an influence on application performance:
• CPU
• Memory system
• Compilers
• OS and runtime environment
• I/O system
• …