slide-1 what makes hpc applications challenging mitre isimit lincoln laboratory benchmarking working...
TRANSCRIPT
Slide-1What Makes HPC
Applications Challenging
MITRE ISIMIT Lincoln Laboratory
Benchmarking Working GroupSession Agenda
1:00-1:15 David Koester What Makes HPC Applications Challenging?
1:15-1:30 Piotr Luszczek HPCchallenge Challenges
1:30-1:45 Fred Tracy Algorithm Comparisons of Application Benchmarks
1:45-2:00 Henry Newman I/O Challenges
2:00-2:15 Phil Colella The Seven Dwarfs
2:15-2:30 Glenn Luecke Run-Time Error Detection Benchmark
2:30-3:00 Break
3:00-3:15 Bill Mann SSCA #1 Draft Specification
3:15-3:30 Theresa Meuse SSCA #6 Draft Specification
3:30-?? Discussions — User Needs
HPCS Vendor Needs for the MS4 Review
HPCS Vendor Needs for the MS5 Review
HPCS Productivity Team Working Groups
Slide-2What Makes HPC
Applications Challenging
MITRE ISIMIT Lincoln Laboratory
This work is sponsored by the Department of Defense under Army Contract W15P7T-05-C-D001.Opinions, interpretations, conclusions, and recommendations are those of the author
and are not necessarily endorsed by the United States Government.
What Makes HPC Applications Challenging?
David Koester, Ph.D
11-13 January 2005HPCS Productivity Team Meeting
Marina Del Rey, CA
Slide-3What Makes HPC
Applications Challenging
MITRE ISIMIT Lincoln Laboratory
Outline
• HPCS Benchmark Spectrum
• What Makes HPC Applications Challenging?– Memory access patterns/locality– Processor characteristics– Concurrency– I/O characteristics– What new challenges will arise from Petascale/s+
applications?
• Bottleneckology– Amdahl’s Law– Example: Random Stride Memory Access
• Summary
Slide-4What Makes HPC
Applications Challenging
MITRE ISIMIT Lincoln Laboratory
HPCS Benchmark Spectrum
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
HPCchallengeBenchmarks
Micro &Kernel
BenchmarksMission Partner
ApplicationBenchmarks
2.Graph
Analysis
2.Graph
Analysis
6.Signal
ProcessingKnowledgeFormation
Ex
isti
ng
Ap
pli
cati
on
s
Em
erg
ing
Ap
pli
cati
on
s
Fu
ture
Ap
pli
cati
on
s
Sim
ula
tio
nIn
telli
gen
ce
Re
con
nai
ssa
nce5.
SimulationMulti-Physics
1.OptimalPattern
Matching
1.OptimalPattern
Matching
4.SimulationNAS PB AU
3.SimulationNWCHEM
Scalable SyntheticCompact Applications
HPCSSpanning
Set ofKernels
Kernels
DiscreteMath…GraphAnalysis…LinearSolvers…SignalProcessing…Simulation…I/O
ExecutionPerformance
Bounds
ExecutionPerformance
Indicators
LocalDGEMMSTREAM
RandomAccess1D FFT
GlobalLinpackPTRANS
RandomAccess1D FFT
CurrentUM2000
GAMESSOVERFLOW
LBMHDRFCTHHYCOM
Near-FutureNWChemALEGRA
CCSM
Execution andDevelopment
Performance Indicators
System Bounds
Slide-5What Makes HPC
Applications Challenging
MITRE ISIMIT Lincoln Laboratory
HPCS Benchmark Spectrum
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
HPCchallengeBenchmarks
Micro &Kernel
BenchmarksMission Partner
ApplicationBenchmarks
2.Graph
Analysis
2.Graph
Analysis
6.Signal
ProcessingKnowledgeFormation
Ex
isti
ng
Ap
pli
cati
on
s
Em
erg
ing
Ap
pli
cati
on
s
Fu
ture
Ap
pli
cati
on
s
Sim
ula
tio
nIn
telli
gen
ce
Re
con
nai
ssa
nce5.
SimulationMulti-Physics
1.OptimalPattern
Matching
1.OptimalPattern
Matching
4.SimulationNAS PB AU
3.SimulationNWCHEM
Scalable SyntheticCompact Applications
HPCSSpanning
Set ofKernels
Kernels
DiscreteMath…GraphAnalysis…LinearSolvers…SignalProcessing…Simulation…I/O
ExecutionPerformance
Bounds
ExecutionPerformance
Indicators
LocalDGEMMSTREAM
RandomAccess1D FFT
GlobalLinpackPTRANS
RandomAccess1D FFT
CurrentUM2000
GAMESSOVERFLOW
LBMHDRFCTHHYCOM
Near-FutureNWChemALEGRA
CCSM
Execution andDevelopment
Performance Indicators
System BoundsWhat Makes
HPC Applications Challenging?
• Full applications may be challenging due to
– Killer Kernels– Global data layouts– Input/Output
• Killer Kernels are challenging because of many things that link directly to architecture
• Identify bottlenecks by mapping applications to architectures
Slide-6What Makes HPC
Applications Challenging
MITRE ISIMIT Lincoln Laboratory
What Makes HPC Applications Challenging?
• Memory access patterns/locality– Spatial and Temporal
Indirect addressing Data dependencies
• Processor characteristics– Processor throughput (Instructions per cycle)
Low arithmetic density Floating point versus integer
– Special features GF(2) math Popcount Integer division
• Concurrency– Ubiquitous for Petascale/s– Load balance
• I/O characteristics– Bandwidth– Latency– File access patterns– File generation rates
Killer KernelsGlobal Data Layouts
Killer Kernels
Killer KernelsGlobal Data Layouts
Input/Output
Slide-7What Makes HPC
Applications Challenging
MITRE ISIMIT Lincoln Laboratory
Cray“Parallel Performance Killer” Kernels
Kernel Performance Characteristic
RandomAccess High demand on remote memoryNo locality
3D FFT Non-unit stridesHigh bandwidth demand
Sparse matrix-vector multiply Irregular, unpredictable locality
Adaptive mesh refinement Dynamic data distribution; dynamic parallelism
Multi-frontal method Multiple levels of parallelism
Sparse incomplete factorization Amdahl’s Law bottlenecks
Preconditioned domain decomposition Frequent large messages
Triangular solver Frequent small messages; poor ratio of computation to communication
Branch-and-bound algorithm Frequent broadcast synchronization
Slide-8What Makes HPC
Applications Challenging
MITRE ISIMIT Lincoln Laboratory
Killer KernelsPhil Colella —The Seven Dwarfs
COMPUTATIONAL RESEARCHDIVISION
1
Algorithms that consume the bulk of the cycles of current high-end systems in DOE
• Structured Grids (including locally structured grids, e.g. AMR)
• Unstructured Grids• Fast Fourier Transform• Dense Linear Algebra• Sparse Linear Algebra • Particles• Monte Carlo(Should also include optimization / solution of nonlinear
systems, which at the high end is something one uses mainly in conjunction with the other seven)
Slide-9What Makes HPC
Applications Challenging
MITRE ISIMIT Lincoln Laboratory
Mission Partner Applications
• How do mission partner applications relate to HPCS spatial/temporal view of memory?– Kernels?– Full applications?
• How do mission partner applications relate to HPCS spatial/temporal view of memory?– Kernels?– Full applications?
0
0.2
0.4
0.6
0.8
1
0.75 0.8 0.85 0.9 0.95 1
Spatial Locality
Tem
po
ral L
oca
lity
AVUS
FFT
HPL
NAS CG C
RandomAccess STREAM
HPCS Challenge PointsHPCchallenge Benchmarks HPCS Challenge Points
HPCchallenge Benchmarks
HighLowLow
PTRANS
MissionPartner
Applications
Tem
pora
lLoc
ality
Spatial Locality
RandomAccess STREAM
HPLHighHigh
FFT
Memory Access Patterns/Locality
Slide-10What Makes HPC
Applications Challenging
MITRE ISIMIT Lincoln Laboratory
Processor CharacteristicsSpecial Features
• Comparison of similar speed MIPS processors with and without
– GF(2) math– Popcount
• Similar or better performance reported using Alpha processors (Jack Collins (NCIFCRF))
• Codes– Cray-supplied library– The Portable Cray Bioinformatics
Library by ARSC
• References– http://www.cray.com/downloads/biolib.pdf
– http://cbl.sourceforge.net/
Algorithmic speedup of 120x
Slide-11What Makes HPC
Applications Challenging
MITRE ISIMIT Lincoln Laboratory
Concurrency
Insert Cluttered VAMPIR Plot here
Slide-12What Makes HPC
Applications Challenging
MITRE ISIMIT Lincoln Laboratory
I/O Relative Data Latency‡
Note: 11 orders of magnitude relative differences!
1.0E+001.0E+011.0E+021.0E+031.0E+041.0E+051.0E+061.0E+071.0E+081.0E+091.0E+101.0E+11
Late
ncy
Diff
eren
ces
‡Henry Newman (Instrumental)
Slide-13What Makes HPC
Applications Challenging
MITRE ISIMIT Lincoln Laboratory
I/O Relative Data Bandwidth per CPU‡
1.0E-02
1.0E-01
1.0E+00
1.0E+01
1.0E+02
1.0E+03
CPURegisters
L1 Cache L2 Cache Memory Disk NAS Tape
Tim
es d
iffer
nce
Note: 5 orders of magnitude relative differences!‡Henry Newman (Instrumental)
Slide-14What Makes HPC
Applications Challenging
MITRE ISIMIT Lincoln Laboratory
StrawmanHPCS I/O Goals/Challenges
• 1 Trillion files in a single file system– 32K file creates per second
• 10K metadata operations per second– Needed for Checkpoint/Restart files
• Streaming I/O at 30 GB/sec full duplex– Needed for data capture
• Support for 30K nodes– Future file system need low latency communication
An envelope on HPCS Mission Partner requirementsAn envelope on HPCS Mission Partner requirements
Slide-15What Makes HPC
Applications Challenging
MITRE ISIMIT Lincoln Laboratory
HPCS Benchmark Spectrum Future and Emerging Applications
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
HPCchallengeBenchmarks
Micro &Kernel
BenchmarksMission Partner
ApplicationBenchmarks
2.Graph
Analysis
2.Graph
Analysis
6.Signal
ProcessingKnowledgeFormation
Ex
isti
ng
Ap
pli
cati
on
s
Em
erg
ing
Ap
pli
cati
on
s
Fu
ture
Ap
pli
cati
on
s
Sim
ula
tio
nIn
telli
gen
ce
Re
con
nai
ssa
nce5.
SimulationMulti-Physics
1.OptimalPattern
Matching
1.OptimalPattern
Matching
4.SimulationNAS PB AU
3.SimulationNWCHEM
Scalable SyntheticCompact Applications
HPCSSpanning
Set ofKernels
Kernels
DiscreteMath…GraphAnalysis…LinearSolvers…SignalProcessing…Simulation…I/O
ExecutionPerformance
Bounds
ExecutionPerformance
Indicators
LocalDGEMMSTREAM
RandomAccess1D FFT
GlobalLinpackPTRANS
RandomAccess1D FFT
CurrentUM2000
GAMESSOVERFLOW
LBMHDRFCTHHYCOM
Near-FutureNWChemALEGRA
CCSM
Execution andDevelopment
Performance Indicators
System Bounds
• Identifying HPCS Mission Partner efforts
– 10-20K processor — 10-100 Teraflop/s scale applications
– 20-120K processor — 100-300 Teraflop/s scale applications
– Petascale/s applications– Applications beyond Petascale/s
• LACSI Workshop — The Path to Extreme Supercomputing
– 12 October 2004– http://www.zettaflops/org
• What new challenges will arise from Petascale/s+ applications?• What new challenges will arise from Petascale/s+ applications?
Slide-16What Makes HPC
Applications Challenging
MITRE ISIMIT Lincoln Laboratory
Outline
• HPCS Benchmark Spectrum
• What Makes HPC Applications Challenging?– Memory access patterns/locality– Processor characteristics– Parallelism– I/O characteristics– What new challenges will arise from Petascale/s+
applications?
• Bottleneckology– Amdahl’s Law– Example: Random Stride Memory Access
• Summary
Slide-17What Makes HPC
Applications Challenging
MITRE ISIMIT Lincoln Laboratory
Bottleneckology
• Bottleneckology– Where is performance lost when an application is run on an
architecture?– When does it make sense to invest in architecture to improve
application performance?– System analysis driven by an extended Amdahl’s Law
Amdahl’s Law is not just about parallel and sequential parts of applications!
• References:– Jack Worlton, "Project Bottleneck: A Proposed Toolkit for
Evaluating Newly-Announced High Performance Computers", Worlton and Associates, Los Alamos, NM, Technical Report No.13,January 1988
– Montek Singh, “Lecture Notes — Computer Architecture and Implementation: COMP 206”, Dept. of Computer Science, Univ. of North Carolina at Chapel Hill, Aug 30, 2004www.cs.unc.edu/~montek/teaching/ fall-04/lectures/lecture-2.ppt
Slide-18What Makes HPC
Applications Challenging
MITRE ISIMIT Lincoln Laboratory
Lecture Notes — Computer Architecture and Implementation (5)‡
‡Montek Singh (UNC)
Slide-19What Makes HPC
Applications Challenging
MITRE ISIMIT Lincoln Laboratory
Lecture Notes — Computer Architecture and Implementation (6)‡
‡Montek Singh (UNC)
Slide-20What Makes HPC
Applications Challenging
MITRE ISIMIT Lincoln Laboratory
Lecture Notes — Computer Architecture and Implementation (7)‡
Also works for Rate = Bandwidth!
‡Montek Singh (UNC)
Slide-21What Makes HPC
Applications Challenging
MITRE ISIMIT Lincoln Laboratory
Lecture Notes — Computer Architecture and Implementation (8)‡
‡Montek Singh (UNC)
Slide-22What Makes HPC
Applications Challenging
MITRE ISIMIT Lincoln Laboratory
Bottleneck Example (1)
• Combine stride 1 and random stride memory access
– 25% random stride access– 33% random stride access
• Memory bandwidth performance is dominated by the random stride memory access
SDSC MAPS on an IBM SP-3
Slide-23What Makes HPC
Applications Challenging
MITRE ISIMIT Lincoln Laboratory
Bottleneck Example (2)
• Combine stride 1 and random stride memory access
– 25% random stride access– 33% random stride access
• Memory bandwidth performance is dominated by the random stride memory access
SDSC MAPS on a COMPAQ Alphaserver
Amdahl’s Law [ 7000 / (7*0.25 + 0.75) ] = 2800 MB/s
Slide-24What Makes HPC
Applications Challenging
MITRE ISIMIT Lincoln Laboratory
Bottleneck Example (2)
• Combine stride 1 and random stride memory access
– 25% random stride access– 33% random stride access
• Memory bandwidth performance is dominated by the random stride memory access
SDSC MAPS on a COMPAQ Alphaserver
Amdahl’s Law [ 7000 / (7*0.25 + 0.75) ] = 2800 MB/s
• Some HPCS Mission Partner applications– Extensive random stride memory access – Some random stride memory access
• However, even a small amount of random memory access can cause significant bottlenecks!
• Some HPCS Mission Partner applications– Extensive random stride memory access – Some random stride memory access
• However, even a small amount of random memory access can cause significant bottlenecks!
Slide-25What Makes HPC
Applications Challenging
MITRE ISIMIT Lincoln Laboratory
Outline
• HPCS Benchmark Spectrum
• What Makes HPC Applications Challenging?– Memory access patterns/locality– Processor characteristics– Parallelism– I/O characteristics– What new challenges will arise from Petascale/s+
applications?
• Bottleneckology– Amdahl’s Law– Example: Random Stride Memory Access
• Summary
Slide-26What Makes HPC
Applications Challenging
MITRE ISIMIT Lincoln Laboratory
Summary (1)
• Memory access patterns/locality– Spatial and Temporal
Indirect addressing Data dependencies
• Processor characteristics– Processor throughput (Instructions per cycle)
Low arithmetic density Floating point versus integer
– Special features GF(2) math Popcount Integer division
• Parallelism– Ubiquitous for Petascale/s– Load balance
• I/O characteristics– Bandwidth– Latency– File access patterns– File generation rates
What makes Applications Challenging!
• Expand this List as required• Work toward consensus with
– HPCS Mission Partners– HPCS Vendors
• Understand Bottlenecks• Characterize applications• Characterize architectures
• Expand this List as required• Work toward consensus with
– HPCS Mission Partners– HPCS Vendors
• Understand Bottlenecks• Characterize applications• Characterize architectures
Slide-27What Makes HPC
Applications Challenging
MITRE ISIMIT Lincoln Laboratory
HPCS Benchmark Spectrum
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
HPCchallengeBenchmarks
Micro &Kernel
BenchmarksMission Partner
ApplicationBenchmarks
2.Graph
Analysis
2.Graph
Analysis
6.Signal
ProcessingKnowledgeFormation
Ex
isti
ng
Ap
pli
cati
on
s
Em
erg
ing
Ap
pli
cati
on
s
Fu
ture
Ap
pli
cati
on
s
Sim
ula
tio
nIn
telli
gen
ce
Re
con
nai
ssa
nce5.
SimulationMulti-Physics
1.OptimalPattern
Matching
1.OptimalPattern
Matching
4.SimulationNAS PB AU
3.SimulationNWCHEM
Scalable SyntheticCompact Applications
HPCSSpanning
Set ofKernels
Kernels
DiscreteMath…GraphAnalysis…LinearSolvers…SignalProcessing…Simulation…I/O
ExecutionPerformance
Bounds
ExecutionPerformance
Indicators
LocalDGEMMSTREAM
RandomAccess1D FFT
GlobalLinpackPTRANS
RandomAccess1D FFT
CurrentUM2000
GAMESSOVERFLOW
LBMHDRFCTHHYCOM
Near-FutureNWChemALEGRA
CCSM
Execution andDevelopment
Performance Indicators
System BoundsWhat Makes
HPC Applications Challenging?
• Full applications may be challenging due to
– Killer Kernels– Global data layouts– Input/Output
• Killer Kernels are challenging because of many things that link directly to architecture
• Identify bottlenecks by mapping applications to architectures
Impress upon the HPCS community to identify
what makes the application challenging when using an existing
Mission Partner application for a systems
analysis in the MS4 review