taming high performance computing with compiler technology
TRANSCRIPT
![Page 1: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/1.jpg)
John Mellor-Crummey
Department of Computer ScienceCenter for High Performance Software Research
Taming High PerformanceComputing with
Compiler Technology
www.cs.rice.edu/~johnmc/presentations/rice-4-04.pdf
![Page 2: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/2.jpg)
2
High Performance Computing Applications
• Scientific inquiry ranging from elementaryparticles to cosmology
• Pollution modeling and remediation planning
• Storm forecasting and climate prediction
• Advanced vehicle design
• Computational chemistry and drug design
• Molecular nanotechnology
• Cryptology
• Nuclear weapons stewardship
![Page 3: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/3.jpg)
3
High Performance Applications
Algorithms
Architectures Data Structures
• Effective parallelizations ↔ scalability
• Single-processor performance can differ by integer factors
![Page 4: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/4.jpg)
4
Status of Highly-parallel Systems
“[Scalable, highly-parallel, microprocessor-based systems] remain in the research andexperimental stage primarily because we
lack adequate software technology,application-development tools,
and, ultimately,well-developed applications.”
— “Information Technology Research: Investing in our Future” PITAC Report to the President, 1999
![Page 5: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/5.jpg)
5
Challenges for Highly Parallel Computing
• Effective algorithms for complex problems
• Programming models and compilers
• Application development tools
• Operating systems for large-scale machines
• Design better high-performance architectures
![Page 6: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/6.jpg)
6
Current Research Themes
• Compiler support for data parallel programming—Implicitly and explicitly parallel global address space languages
• Technology for auto-tuning software—Automatically tailor code to a microprocessor architecture
• Performance analysis tools—Understanding application behavior on current systems
• Performance modeling—How will applications perform at different scales and on future
systems
• Compiler technology for scientific scripting languages—R language for statistical programming
![Page 7: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/7.jpg)
7
Outline
• Motivation
• Compiler technology for HPC
☛Compiling data-parallel languages
— Semi-automatic synthesis of performance models
• Challenges for the future
• Other work
![Page 8: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/8.jpg)
8
Compiling data-parallel languages
• Introduction
— Data parallelism
— Compiling HPF-like languages
• Rice dHPF compiler
— Data partitioning research
— Analysis and code generation
• Experimental results
![Page 9: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/9.jpg)
9
Data Parallelism
• Apply the same operation to many data elements
—need not be synchronous
—need not be completely uniform
• Applicable to many problems in science andengineering
![Page 10: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/10.jpg)
10
Data Parallel Programming Alternatives
• Hand-coded parallelizations using library-based models—complete applicability
—difficult to design and implement
—all responsibility for tuning falls to the developer
• Application frameworks—easy to use
—limited applicability
• Single-threaded data-parallel languages—much more flexible than application frameworks
—much simpler to use than hand-coded parallelizations
—compilers significantly determines performance– offload details of tuning from the developer
– compilers are enormously complex
– out of luck if the compiler doesn’t deliver performance
![Page 11: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/11.jpg)
11
Same answers assequential program
Partition computationInsert communicationManage storage
Parallel Machine
HPF Program Compilation
Fortran program+ data partitioning
Data Parallel Compilation
High Performance Fortran
Partitioning of data drives partitioning ofcomputation, communication, and synchronization
![Page 12: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/12.jpg)
12
DO i = 2, n - 1 DO j = 2, n - 1 A(i,j) = .25 *(B(i-1,j) + B(i+1,j)+ B(i,j-1) + B(i,j+1))
CHPF$ processors P(3,3)CHPF$ distribute A(block, block) onto PCHPF$ distribute B(block, block) onto P
Processors
P(0,0)
P(2,2)
Data for A, B
(BLOCK,BLOCK)distribution
Example HPF Program
![Page 13: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/13.jpg)
13
Compiling HPF-like Languages
• Partition data
• Select mapping of computation to processors
• Analyze communication requirements
• Partition computation by reducing loop bounds
• Insert communication
![Page 14: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/14.jpg)
14
The Devil is in the Details …
• Good data and computation partitionings are a must
—without good partitionings, parallelism suffers!
• Excess communication undermines scalability
—both frequency and volume must be right!
• Single processor efficiency is critical
—must use caches effectively
—node code must be amenable to optimization
Goal Compiler and runtime techniques that
enable simple and natural programming, yet deliver the performance of hand-coded parallelizations
![Page 15: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/15.jpg)
15
Achievements • parallelize sequential codes with minimal rewriting • near hand-coded performance for tightly coupled codes
Rice dHPF Compiler
Innovations
• Sophisticated data partitionings
• Abstract set-based framework for communicationanalysis, code generation
• Sophisticated computation partitionings
—partial replication to reduce communication
• Comprehensive optimizations
![Page 16: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/16.jpg)
16
recurrences make parallelization difficult with BLOCK partitionings
do j = 1, n do i = 2,n a(i,j) = … a(i-1,j)
Data Partitioning
• Good parallel performance requires suitable partitioning
• Tightly-coupled computations are problematic
• Line-sweep computations: e.g., ADI integration
![Page 17: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/17.jpg)
17
Coarse-Grain Pipelining
Processor 0
Processor 1
Processor 2
Processor 3
Partial serialization induces wavefront parallelism
with block partitioning
Compute along partitioned dimensions
![Page 18: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/18.jpg)
18
Coarse-Grain Pipelining
Processor 0
Processor 1
Processor 2
Processor 3
Partial serialization induces wavefront parallelism
with block partitioning
Compute along partitioned dimensions
![Page 19: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/19.jpg)
19
Hand-codedmultipartitioning}
}Compiler-generated
coarse-grainpipelining
Parallelizing Line Sweeps
![Page 20: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/20.jpg)
20
Processor 0
Processor 1
Processor 2
Processor 3
Diagonal Multipartitioning
• Each processor owns 1 tile between each pair ofcuts along each distributed dimension
• Enables full parallelism for a sweep along anypartitioned dimension
![Page 21: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/21.jpg)
21
Processor 0
Processor 1
Processor 2
Processor 3
Diagonal Multipartitioning
• Each processor owns 1 tile between each pair ofcuts along each distributed dimension
• Enables full parallelism for a sweep along anypartitioned dimension
![Page 22: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/22.jpg)
22
Generalized Multipartitioning
• Partitioning constraints—# tiles in each λ - 1 dimensional hyperplane is a multiple of p—no more cuts than necessary
• Objective function: minimize communication volume
—pick the configuration of cuts to minimize total cross section
IPDPS 2002 Best paper in Algorithms; JPDC 2003
• Mapping constraints
— load balance: in a hyperplane, each proc has same # tiles
—neighbor: in any particular direction, the neighbor of a givenprocessor is the same
Given an n-dimensional data domain and p processors, select—which λ dimensions to partition, 2 ≤ λ ≤ n; how many cuts in each
![Page 23: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/23.jpg)
23
Choosing the Best Partitioning
• Enumerate all elementary partitionings
—candidates depend on factorization of p
• Evaluate their communication cost
• Select the minimum cost partitioning
Number of choices for pickinga pair of dimensions
to partition with a number of cutsdivisible by a particular prime factor
Possible unique factors of p
( )( )( )
ppo
dd logloglog11
21
+
−
complexity:
very fast in practice.
worst case: p is a product of unique prime factors
![Page 24: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/24.jpg)
24
Mapping Tiles with Modular Mappings
0 0
00
00
Basic Tile Shape
Modular Shift
Integral # of shapes
Integral # of shapes
Modular Shift
![Page 25: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/25.jpg)
25
3 types of Sets
DataIterationsProcessors
3 types of Mappings
iterations data
dataprocessors
processorsiterations
Layout:Reference:CompPart:
Formal Compilation Framework
• Representation
—integer tuples with Presburger arithmetic for constraints
• Analysis: Use set equations to compute set(s) of interest
— iterations allocated to a processor
—communication sets
• Code generation: Synthesize loops from set(s), e.g.
—parallel (SPMD) loop nests
—message packing and unpacking
[Adve & Mellor-Crummey, PLDI98]
![Page 26: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/26.jpg)
26
processors P(3,3)distribute A(block, block) onto Pdistribute B(block, block) onto PDO i = 2, n - 1 DO j = 2, n - 1 A(i, j) = .25 *( B(i-1, j) + B(i+1, j) + B(i, j-1) + B(i, j+1) )
P(x,y)
Local section for P(x,y)(and iterations executed)
Non-local dataaccessed
Iterations that accessnon-local data
} 2930yj230y
1920xi220x :j][i, {
+≤≤+&
+≤≤+
data / loop partitioning
20
30
P(0,0) P(1,0) P(2,0)
P(0,1) P(1,1) P(2,1)
P(0,2) P(1,2) P(2,2)
Why Symbolic Sets?
![Page 27: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/27.jpg)
27
symbolic N
Layout := { [pid] -> [i] : 25 *pid + 1 ≤ i ≤ 25 *pid + 25 }
Loop := { [i] : 1 ≤ i ≤ N }
CPSubscript := { [i] [i-1] }
RefSubscript := { [i] [i-2] }
real A(100) distribute A(BLOCK) on P(4) do i = 1, N ... = A(i-1) + A(i-2) + ... ! ON_HOME A(i-1) enddo
CompPart := (Layout o CPSubscript -1) ∩ Loop
DataAccessed = CompPart o RefSubscript
NonLocal Data Accessed = DataAccessed - Layout
Integer-Set Framework: Example
![Page 28: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/28.jpg)
28
Optimizations using Integer Sets
• Partially replicate computation to reduce communication
—66% lower message volume, 38% faster: NAS BT @ 64 procs
• Coalesce communication sets for multiple references
—41% lower message volume, 35% faster: NAS SP @ 64 procs
• Split loops into “local-only” and “off-processor” loops
—10% fewer Dcache misses, 9% faster: NAS SP @64 procs
• Processor set constraints on communication sets
—12% fewer Icache misses, 7% faster: NAS SP @ 64 procs
PACT 2002 Best student paper(with Daniel Chavarria-Miranda)
![Page 29: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/29.jpg)
29
• NAS SP & BT benchmarks from NASA Ames—use ADI to solve the Navier-Stokes equation in 3D
—forward & backward line sweeps on each dimension, each timestep
• Compare four variants—MPI hand-coded multipartitioning (NASA)
—dHPF: multipartitioned
—dHPF: 2D partitioning, coarse-grain pipelining
—PGI’s pghpf: 1D partitioning with transpose
• Platform—SGI Origin 2000: 128 250 MHz procs.
—SGI compilers + SGI MPI
Experimental Evaluation
![Page 30: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/30.jpg)
30
Efficiency for NAS SP (1023 ‘B’ size)
> 2x multipartitioning comm. volume
similar comm. volume, more serialization
![Page 31: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/31.jpg)
31
Efficiency for NAS BT (1023 ‘B’ size)
> 2x multipartitioning comm. volume
Platform: SGI Origin 2000
![Page 32: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/32.jpg)
32
NAS BT Parallelizations
Hand-coded3D Multipartitioning
Compiler-generated3D Multipartitioning
Execution Traces for NAS BT Class 'A' - 16 processors, SGI Origin 2000
![Page 33: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/33.jpg)
33
Observations
• High performance requires perfection
—parallelism and load-balance
—communication frequency
—communication volume
—scalar performance
• Data-parallel compiler technology can
—ease the programming burden
—yield near hand-coded performance
![Page 34: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/34.jpg)
34
Data-parallel Related Work
• Linear equations/set-based compilation
—[Pugh et al; Ancourt et al; Amarasinghe & Lam]
• Commercial HPF compilers
—xlHPF, pghpf, xHPF
• HPF/JA
—14 Teraflops on a code for the Earth Simulator
• Lots of research compiler efforts
—e.g. Polaris, CAPTOOLS
None support partially-replicated computationNone support multipartitioningNone achieve linear scaling on tightly-coupled codes
![Page 35: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/35.jpg)
35
Outline
• Motivation
• Compiler technology for HPC
— Data-parallel programming systems
☛Semi-automatic synthesis of performance models
• Challenges for the future
• Other work
![Page 36: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/36.jpg)
36
Why Performance Modeling?
• Insight into applications
—barriers to scalability
—insight into optimizations
• Mapping applications to systems
—Grid resource selection & scheduling
—intelligent run-time adaptation
• Workload-based design of future systems
![Page 37: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/37.jpg)
37
Modeling Challenges
• Performance depends on:
—architecture specific factors
—application characteristics
—input data parameters
• Difficult to model execution time directly
• Collecting data at scale is expensive
![Page 38: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/38.jpg)
38
Approach
Separate contribution of application characteristics
• Measure the application-specific factors
—static analysis
—dynamic analysis
• Construct scalable models
• Explore interactions with hardware
Use binary analysis and instrumentation forlanguage and programming model independence
[Marin & Mellor-Crummey SIGMETRICS 04]
![Page 39: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/39.jpg)
39
Toolkit Design Overview
ObjectCode
BinaryInstrumenter
InstrumentedCode
Execute
BBCounts
CommunicationVolume &Frequency
MemoryReuse
Distance
BinaryAnalyzer
Control flow graph
Loop nestingstructure
BB instruction mixPost Processing Tool
Architectureneutral model Scheduler
ArchitectureDescription
PerformancePredictionfor Target
ArchitectureStatic Analysis
DynamicAnalysis
Post Processing
![Page 40: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/40.jpg)
40
Building Scalable Models
• Collect data from multiple runs
—n+1 runs to compute a model of degree n
• Approximation function:
F(X) = cn*Bn(X)+cn-1*Bn-1(X)+…+c0*B0(X)
• A set of basis functions
• Include constraints
• Goal: determine coefficients
Use quadratic programming
![Page 41: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/41.jpg)
41
Execution Frequency Modeling Example
Execution Frequency Model
0
200000
400000
600000
800000
1000000
0 5 10 15 20 25 30 35 40 45
Problem Size
Fre
qu
en
cy
Collected data
289200
23 …
18316013110453020202445784Count
1815952X
![Page 42: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/42.jpg)
42
Execution Frequency Modeling Example
Execution Frequency Model
0
200000
400000
600000
800000
1000000
0 5 10 15 20 25 30 35 40 45
Problem Size
Fre
qu
en
cy
Collected data
Model degree 0
Y=41416, Err=131%
289200
23 …
18316013110453020202445784Count
1815952X
![Page 43: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/43.jpg)
43
Execution Frequency Modeling Example
Execution Frequency Model
0
200000
400000
600000
800000
1000000
0 5 10 15 20 25 30 35 40 45
Problem Size
Fre
qu
en
cy
Collected data
Model degree 0
Model degree 1
Y=41416, Err=131%
Y=16776*X-42366,Err=60.4%
289200
23 …
18316013110453020202445784Count
1815952X
![Page 44: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/44.jpg)
44
Execution Frequency Modeling Example
Execution Frequency Model
0
200000
400000
600000
800000
1000000
0 5 10 15 20 25 30 35 40 45
Problem Size
Fre
qu
en
cy
Collected data
Model degree 0
Model degree 1
Model degree 2
Y=41416, Err=131%
Y=16776*X-42366,Err=60.4%
Y=482*X2+1446*X+964,Err=0%
289200
23 …
18316013110453020202445784Count
1815952X
![Page 45: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/45.jpg)
45
Predict Schedule Latency for an Architecture
• Input:
—basic block and edge execution frequency
• Methodology:
—recover executed paths
—SPARC instructions ➔ generic RISC
—instantiate scheduler for architecture
—construct schedule for executed paths
—determine inefficiencies
![Page 46: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/46.jpg)
46
Toolkit Design Overview
ObjectCode
BinaryInstrumenter
InstrumentedCode
Execute
BBCounts
CommunicationVolume &Frequency
MemoryReuse
Distance
BinaryAnalyzer
Control flow graph
Loop nestingstructure
BB instruction mixPost Processing Tool
Architectureneutral model Scheduler
ArchitectureDescription
PerformancePredictionfor Target
ArchitectureStatic Analysis
DynamicAnalysis
Post Processing
![Page 47: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/47.jpg)
47
Memory Reuse Distance
• MRD: # unique data blocks referenced sincetarget block last accessed
memory block
MRD
• I1: 1 cold miss
• I2: 2 cold misses, 1 @ distance 2
• I3: 1 @ distance 0, 2 @ distance 1
reference
0
B
I3
2
B
I2
1
A
I3
∞C
I2
1
A
I3
∞B
I2
∞A
I1
![Page 48: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/48.jpg)
48
Memory reuse distance
![Page 49: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/49.jpg)
49
Modeling Memory Reuse Distance
• More complex than execution frequency
—cold misses
—histogram of reuse distances– number of bins not constant
• Average reuse distance is misleading
—1 access with distance 10,000
—3 accesses with distance 0
—cache has 1024 blocks
2500 average
![Page 50: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/50.jpg)
50
Modeling Memory Reuse Distance
2 13 40
50%
30%20%
Reuse distance
No
rmal
ized
freq
uen
cy
![Page 51: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/51.jpg)
51
Modeling Memory Reuse Distance
![Page 52: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/52.jpg)
52
Predict Number of Cache Misses
• Instantiate model for problem size 100
74%
96%
![Page 53: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/53.jpg)
53
Prediction: NAS BT 3.0 Mem Hier Utilization
NAS BT 3.0 Memory Hierarchy Utilization
0
50
100
150
200
250
300
0 20 40 60 80 100 120 140 160 180 200
Mesh size
Mis
s c
ou
nt
/ C
ell
/ T
ime s
tep
L1 measuredL1 predictedL2 measured(x10)L2 predicted(x10)TLB measured(x10)TLB predicted(x10)
![Page 54: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/54.jpg)
54
Prediction: NAS BT 3.0 Time on SGI Origin
0500
10001500200025003000350040004500
Measured time Schedulerlatency
L1 miss penalty
NAS BT 3.0 from SPARC to SGI Origin
0
1000
2000
3000
4000
5000
6000
7000
0 50 100 150 200Mesh size
Cycle
s /
Cell /
Tim
e s
tep
Measured time Scheduler latency L1 miss penaltyL2 miss penalty TLB miss penalty Predicted time
![Page 55: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/55.jpg)
55
Open Performance Modeling Issues
• Short term
—Better modeling of memory subsystem– # outstanding loads to accurately predict memory latency
—Explore modeling of irregular applications
• Long term
—Model parallel applications– Present modeling applies between synchronization points
– Combine with manually constructed parallel models
– Semi-automatically recover parallel trends
—Understand dynamic parallelism
![Page 56: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/56.jpg)
56
Modeling Related Work
• Reuse distance—Cache utilization [Beyls & D’Hollander]
—Investigating optimizations [Ding et al.]
• Program instrumentation—EEL, QPT [Ball, Larus, Schnarr]
• Scalable analytic models—[Vernon et al; Hoisie et al.]
• Cross-architecture models at scale—[Snavely et al.; Cascaval et al.]
• Simulation (trace-based and execution-driven)
None yield semi-automatically derived scalable models
![Page 57: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/57.jpg)
57
HPC Compiler Challenges for the Future
• Programming systems for large-scale machines—Abstraction and greater expressiveness are needed
—Potential parallelism must be readily accessible– implicit parallelism or explicit element-wise parallelism
—Locality and latency tolerance are both critical for performance
—Dynamic self-scheduled parallelism will be necessary
—Failure will occur and must be expected and handled
• Support for “self-tuning software” for complex architectures
• Compiler-based tools—Debugging and performance analysis of large-scale software
on dynamic systems is a major open problem
• Insight into hardware design—Understanding impact of proposed designs on whole programs
![Page 58: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/58.jpg)
58
Past Work
• Multiprocessor synchronization
—locks, synchronous barriers [ASPLOS89, TOCS91]
—reader-writer synchronization [PPOPP91]
—fuzzy barriers [IJPP94]
• Parallel debugging
—execution replay [JPDC90, TOC87]
—software instruction counter [ASPLOS89]
—detecting data races [WPDD93, SC91, SC90]
• Parallel programming environments
—Parascope [PIEEE 93], Dsystem [TPDT94]
• Parallel applications
—molecular dynamics [JCC92]
![Page 59: Taming High Performance Computing with Compiler Technology](https://reader031.vdocuments.net/reader031/viewer/2022012408/616a387911a7b741a35020bc/html5/thumbnails/59.jpg)
59
Ongoing Work
• Global address space parallel languages
—Co-array Fortran [LCPC03]
• Performance analysis
— [TJS02, LACSI01, ICS01, SIGMETRICS01]
• Improving node performance
—irregular mesh and particle codes [ICS99, IJPP00]
—sparse matrices [LACSI02, IJHPCA04]
—multigrid [ICS01]
—dense matrices [LACSI03]
• Grid computing [IJHPCA01]
• Library-based domain languages [JPDC01]