taming high performance computing with compiler technology

John Mellor-Crummey

Department of Computer ScienceCenter for High Performance Software Research

Taming High PerformanceComputing with

Compiler Technology

www.cs.rice.edu/~johnmc/presentations/rice-4-04.pdf

2

High Performance Computing Applications

• Scientific inquiry ranging from elementaryparticles to cosmology

• Pollution modeling and remediation planning

• Storm forecasting and climate prediction

• Advanced vehicle design

• Computational chemistry and drug design

• Molecular nanotechnology

• Cryptology

• Nuclear weapons stewardship

3

High Performance Applications

Algorithms

Architectures Data Structures

• Effective parallelizations ↔ scalability

• Single-processor performance can differ by integer factors

4

Status of Highly-parallel Systems

“[Scalable, highly-parallel, microprocessor-based systems] remain in the research andexperimental stage primarily because we

lack adequate software technology,application-development tools,

and, ultimately,well-developed applications.”

— “Information Technology Research: Investing in our Future” PITAC Report to the President, 1999

5

Challenges for Highly Parallel Computing

• Effective algorithms for complex problems

• Programming models and compilers

• Application development tools

• Operating systems for large-scale machines

• Design better high-performance architectures

6

Current Research Themes

• Compiler support for data parallel programming—Implicitly and explicitly parallel global address space languages

• Technology for auto-tuning software—Automatically tailor code to a microprocessor architecture

• Performance analysis tools—Understanding application behavior on current systems

• Performance modeling—How will applications perform at different scales and on future

systems

• Compiler technology for scientific scripting languages—R language for statistical programming

7

Outline

• Motivation

• Compiler technology for HPC

☛Compiling data-parallel languages

— Semi-automatic synthesis of performance models

• Challenges for the future

• Other work

8

Compiling data-parallel languages

• Introduction

— Data parallelism

— Compiling HPF-like languages

• Rice dHPF compiler

— Data partitioning research

— Analysis and code generation

• Experimental results

9

Data Parallelism

• Apply the same operation to many data elements

—need not be synchronous

—need not be completely uniform

• Applicable to many problems in science andengineering

10

Data Parallel Programming Alternatives

• Hand-coded parallelizations using library-based models—complete applicability

—difficult to design and implement

—all responsibility for tuning falls to the developer

• Application frameworks—easy to use

—limited applicability

• Single-threaded data-parallel languages—much more flexible than application frameworks

—much simpler to use than hand-coded parallelizations

—compilers significantly determines performance– offload details of tuning from the developer

– compilers are enormously complex

– out of luck if the compiler doesn’t deliver performance

11

Same answers assequential program

Partition computationInsert communicationManage storage

Parallel Machine

HPF Program Compilation

Fortran program+ data partitioning

Data Parallel Compilation

High Performance Fortran

Partitioning of data drives partitioning ofcomputation, communication, and synchronization

12

DO i = 2, n - 1 DO j = 2, n - 1 A(i,j) = .25 *(B(i-1,j) + B(i+1,j)+ B(i,j-1) + B(i,j+1))

CHPF$ processors P(3,3)CHPF$ distribute A(block, block) onto PCHPF$ distribute B(block, block) onto P

Processors

P(0,0)

P(2,2)

Data for A, B

(BLOCK,BLOCK)distribution

Example HPF Program

13

Compiling HPF-like Languages

• Partition data

• Select mapping of computation to processors

• Analyze communication requirements

• Partition computation by reducing loop bounds

• Insert communication

14

The Devil is in the Details …

• Good data and computation partitionings are a must

—without good partitionings, parallelism suffers!

• Excess communication undermines scalability

—both frequency and volume must be right!

• Single processor efficiency is critical

—must use caches effectively

—node code must be amenable to optimization

Goal Compiler and runtime techniques that

enable simple and natural programming, yet deliver the performance of hand-coded parallelizations

15

Achievements • parallelize sequential codes with minimal rewriting • near hand-coded performance for tightly coupled codes

Rice dHPF Compiler

Innovations

• Sophisticated data partitionings

• Abstract set-based framework for communicationanalysis, code generation

• Sophisticated computation partitionings

—partial replication to reduce communication

• Comprehensive optimizations

16

recurrences make parallelization difficult with BLOCK partitionings

do j = 1, n do i = 2,n a(i,j) = … a(i-1,j)

Data Partitioning

• Good parallel performance requires suitable partitioning

• Tightly-coupled computations are problematic

• Line-sweep computations: e.g., ADI integration

17

Coarse-Grain Pipelining

Processor 0

Processor 1

Processor 2

Processor 3

Partial serialization induces wavefront parallelism

with block partitioning

Compute along partitioned dimensions

18

Coarse-Grain Pipelining

Processor 0

Processor 1

Processor 2

Processor 3

Partial serialization induces wavefront parallelism

with block partitioning

Compute along partitioned dimensions

19

Hand-codedmultipartitioning}

}Compiler-generated

coarse-grainpipelining

Parallelizing Line Sweeps

20

Processor 0

Processor 1

Processor 2

Processor 3

Diagonal Multipartitioning

• Each processor owns 1 tile between each pair ofcuts along each distributed dimension

• Enables full parallelism for a sweep along anypartitioned dimension

21

Processor 0

Processor 1

Processor 2

Processor 3

Diagonal Multipartitioning

• Each processor owns 1 tile between each pair ofcuts along each distributed dimension

• Enables full parallelism for a sweep along anypartitioned dimension

22

Generalized Multipartitioning

• Partitioning constraints—# tiles in each λ - 1 dimensional hyperplane is a multiple of p—no more cuts than necessary

• Objective function: minimize communication volume

—pick the configuration of cuts to minimize total cross section

IPDPS 2002 Best paper in Algorithms; JPDC 2003

• Mapping constraints

— load balance: in a hyperplane, each proc has same # tiles

—neighbor: in any particular direction, the neighbor of a givenprocessor is the same

Given an n-dimensional data domain and p processors, select—which λ dimensions to partition, 2 ≤ λ ≤ n; how many cuts in each

23

Choosing the Best Partitioning

• Enumerate all elementary partitionings

—candidates depend on factorization of p

• Evaluate their communication cost

• Select the minimum cost partitioning

Number of choices for pickinga pair of dimensions

to partition with a number of cutsdivisible by a particular prime factor

Possible unique factors of p

( )( )( )

ppo

dd logloglog11

21

+

−

complexity:

very fast in practice.

worst case: p is a product of unique prime factors

24

Mapping Tiles with Modular Mappings

0 0

00

00

Basic Tile Shape

Modular Shift

Integral # of shapes

Integral # of shapes

Modular Shift

25

3 types of Sets

DataIterationsProcessors

3 types of Mappings

iterations data

dataprocessors

processorsiterations

Layout:Reference:CompPart:

Formal Compilation Framework

• Representation

—integer tuples with Presburger arithmetic for constraints

• Analysis: Use set equations to compute set(s) of interest

— iterations allocated to a processor

—communication sets

• Code generation: Synthesize loops from set(s), e.g.

—parallel (SPMD) loop nests

—message packing and unpacking

[Adve & Mellor-Crummey, PLDI98]

26

processors P(3,3)distribute A(block, block) onto Pdistribute B(block, block) onto PDO i = 2, n - 1 DO j = 2, n - 1 A(i, j) = .25 *( B(i-1, j) + B(i+1, j) + B(i, j-1) + B(i, j+1) )

P(x,y)

Local section for P(x,y)(and iterations executed)

Non-local dataaccessed

Iterations that accessnon-local data

} 2930yj230y

1920xi220x :j][i, {

+≤≤+&

+≤≤+

data / loop partitioning

20

30

P(0,0) P(1,0) P(2,0)

P(0,1) P(1,1) P(2,1)

P(0,2) P(1,2) P(2,2)

Why Symbolic Sets?

27

symbolic N

Layout := { [pid] -> [i] : 25 *pid + 1 ≤ i ≤ 25 *pid + 25 }

Loop := { [i] : 1 ≤ i ≤ N }

CPSubscript := { [i] [i-1] }

RefSubscript := { [i] [i-2] }

real A(100) distribute A(BLOCK) on P(4) do i = 1, N ... = A(i-1) + A(i-2) + ... ! ON_HOME A(i-1) enddo

CompPart := (Layout o CPSubscript -1) ∩ Loop

DataAccessed = CompPart o RefSubscript

NonLocal Data Accessed = DataAccessed - Layout

Integer-Set Framework: Example

28

Optimizations using Integer Sets

• Partially replicate computation to reduce communication

—66% lower message volume, 38% faster: NAS BT @ 64 procs

• Coalesce communication sets for multiple references

—41% lower message volume, 35% faster: NAS SP @ 64 procs

• Split loops into “local-only” and “off-processor” loops

—10% fewer Dcache misses, 9% faster: NAS SP @64 procs

• Processor set constraints on communication sets

—12% fewer Icache misses, 7% faster: NAS SP @ 64 procs

PACT 2002 Best student paper(with Daniel Chavarria-Miranda)

29

• NAS SP & BT benchmarks from NASA Ames—use ADI to solve the Navier-Stokes equation in 3D

—forward & backward line sweeps on each dimension, each timestep

• Compare four variants—MPI hand-coded multipartitioning (NASA)

—dHPF: multipartitioned

—dHPF: 2D partitioning, coarse-grain pipelining

—PGI’s pghpf: 1D partitioning with transpose

• Platform—SGI Origin 2000: 128 250 MHz procs.

—SGI compilers + SGI MPI

Experimental Evaluation

30

Efficiency for NAS SP (1023 ‘B’ size)

> 2x multipartitioning comm. volume

similar comm. volume, more serialization

31

Efficiency for NAS BT (1023 ‘B’ size)

> 2x multipartitioning comm. volume

Platform: SGI Origin 2000

32

NAS BT Parallelizations

Hand-coded3D Multipartitioning

Compiler-generated3D Multipartitioning

Execution Traces for NAS BT Class 'A' - 16 processors, SGI Origin 2000

33

Observations

• High performance requires perfection

—parallelism and load-balance

—communication frequency

—communication volume

—scalar performance

• Data-parallel compiler technology can

—ease the programming burden

—yield near hand-coded performance

34

Data-parallel Related Work

• Linear equations/set-based compilation

—[Pugh et al; Ancourt et al; Amarasinghe & Lam]

• Commercial HPF compilers

—xlHPF, pghpf, xHPF

• HPF/JA

—14 Teraflops on a code for the Earth Simulator

• Lots of research compiler efforts

—e.g. Polaris, CAPTOOLS

None support partially-replicated computationNone support multipartitioningNone achieve linear scaling on tightly-coupled codes

35

Outline

• Motivation

• Compiler technology for HPC

— Data-parallel programming systems

☛Semi-automatic synthesis of performance models

• Challenges for the future

• Other work

36

Why Performance Modeling?

• Insight into applications

—barriers to scalability

—insight into optimizations

• Mapping applications to systems

—Grid resource selection & scheduling

—intelligent run-time adaptation

• Workload-based design of future systems

37

Modeling Challenges

• Performance depends on:

—architecture specific factors

—application characteristics

—input data parameters

• Difficult to model execution time directly

• Collecting data at scale is expensive

38

Approach

Separate contribution of application characteristics

• Measure the application-specific factors

—static analysis

—dynamic analysis

• Construct scalable models

• Explore interactions with hardware

Use binary analysis and instrumentation forlanguage and programming model independence

[Marin & Mellor-Crummey SIGMETRICS 04]

39

Toolkit Design Overview

ObjectCode

BinaryInstrumenter

InstrumentedCode

Execute

BBCounts

CommunicationVolume &Frequency

MemoryReuse

Distance

BinaryAnalyzer

Control flow graph

Loop nestingstructure

BB instruction mixPost Processing Tool

Architectureneutral model Scheduler

ArchitectureDescription

PerformancePredictionfor Target

ArchitectureStatic Analysis

DynamicAnalysis

Post Processing

40

Building Scalable Models

• Collect data from multiple runs

—n+1 runs to compute a model of degree n

• Approximation function:

F(X) = cn*Bn(X)+cn-1*Bn-1(X)+…+c0*B0(X)

• A set of basis functions

• Include constraints

• Goal: determine coefficients

Use quadratic programming

41

Execution Frequency Modeling Example

Execution Frequency Model

0

200000

400000

600000

800000

1000000

0 5 10 15 20 25 30 35 40 45

Problem Size

Fre

qu

en

cy

Collected data

289200

23 …

18316013110453020202445784Count

1815952X

42



0

200000

400000

600000

800000

1000000

0 5 10 15 20 25 30 35 40 45

Problem Size

Fre

qu

en

cy

Collected data

Model degree 0

Y=41416, Err=131%

289200

23 …

18316013110453020202445784Count

1815952X

43



0

200000

400000

600000

800000

1000000

0 5 10 15 20 25 30 35 40 45

Problem Size

Fre

qu

en

cy

Collected data

Model degree 0

Model degree 1

Y=41416, Err=131%

Y=16776*X-42366,Err=60.4%

289200

23 …

18316013110453020202445784Count

1815952X

44



0

200000

400000

600000

800000

1000000

0 5 10 15 20 25 30 35 40 45

Problem Size

Fre

qu

en

cy

Collected data

Model degree 0

Model degree 1

Model degree 2

Y=41416, Err=131%

Y=16776*X-42366,Err=60.4%

Y=482*X2+1446*X+964,Err=0%

289200

23 …

18316013110453020202445784Count

1815952X

45

Predict Schedule Latency for an Architecture

• Input:

—basic block and edge execution frequency

• Methodology:

—recover executed paths

—SPARC instructions ➔ generic RISC

—instantiate scheduler for architecture

—construct schedule for executed paths

—determine inefficiencies

46

Toolkit Design Overview

ObjectCode

BinaryInstrumenter

InstrumentedCode

Execute

BBCounts

CommunicationVolume &Frequency

MemoryReuse

Distance

BinaryAnalyzer

Control flow graph

Loop nestingstructure

BB instruction mixPost Processing Tool

Architectureneutral model Scheduler

ArchitectureDescription

PerformancePredictionfor Target

ArchitectureStatic Analysis

DynamicAnalysis

Post Processing

47

Memory Reuse Distance

• MRD: # unique data blocks referenced sincetarget block last accessed

memory block

MRD

• I1: 1 cold miss

• I2: 2 cold misses, 1 @ distance 2

• I3: 1 @ distance 0, 2 @ distance 1

reference

0

B

I3

2

B

I2

1

A

I3

∞C

I2

1

A

I3

∞B

I2

∞A

I1

48

Memory reuse distance

49

Modeling Memory Reuse Distance

• More complex than execution frequency

—cold misses

—histogram of reuse distances– number of bins not constant

• Average reuse distance is misleading

—1 access with distance 10,000

—3 accesses with distance 0

—cache has 1024 blocks

2500 average

50


2 13 40

50%

30%20%

Reuse distance

No

rmal

ized

freq

uen

cy

51


52

Predict Number of Cache Misses

• Instantiate model for problem size 100

74%

96%

53

Prediction: NAS BT 3.0 Mem Hier Utilization

NAS BT 3.0 Memory Hierarchy Utilization

0

50

100

150

200

250

300

0 20 40 60 80 100 120 140 160 180 200

Mesh size

Mis

s c

ou

nt

/ C

ell

/ T

ime s

tep

L1 measuredL1 predictedL2 measured(x10)L2 predicted(x10)TLB measured(x10)TLB predicted(x10)

54

Prediction: NAS BT 3.0 Time on SGI Origin

0500

10001500200025003000350040004500

Measured time Schedulerlatency

L1 miss penalty

NAS BT 3.0 from SPARC to SGI Origin

0

1000

2000

3000

4000

5000

6000

7000

0 50 100 150 200Mesh size

Cycle

s /

Cell /

Tim

e s

tep

Measured time Scheduler latency L1 miss penaltyL2 miss penalty TLB miss penalty Predicted time

55

Open Performance Modeling Issues

• Short term

—Better modeling of memory subsystem– # outstanding loads to accurately predict memory latency

—Explore modeling of irregular applications

• Long term

—Model parallel applications– Present modeling applies between synchronization points

– Combine with manually constructed parallel models

– Semi-automatically recover parallel trends

—Understand dynamic parallelism

56

Modeling Related Work

• Reuse distance—Cache utilization [Beyls & D’Hollander]

—Investigating optimizations [Ding et al.]

• Program instrumentation—EEL, QPT [Ball, Larus, Schnarr]

• Scalable analytic models—[Vernon et al; Hoisie et al.]

• Cross-architecture models at scale—[Snavely et al.; Cascaval et al.]

• Simulation (trace-based and execution-driven)

None yield semi-automatically derived scalable models

57

HPC Compiler Challenges for the Future

• Programming systems for large-scale machines—Abstraction and greater expressiveness are needed

—Potential parallelism must be readily accessible– implicit parallelism or explicit element-wise parallelism

—Locality and latency tolerance are both critical for performance

—Dynamic self-scheduled parallelism will be necessary

—Failure will occur and must be expected and handled

• Support for “self-tuning software” for complex architectures

• Compiler-based tools—Debugging and performance analysis of large-scale software

on dynamic systems is a major open problem

• Insight into hardware design—Understanding impact of proposed designs on whole programs

58

Past Work

• Multiprocessor synchronization

—locks, synchronous barriers [ASPLOS89, TOCS91]

—reader-writer synchronization [PPOPP91]

—fuzzy barriers [IJPP94]

• Parallel debugging

—execution replay [JPDC90, TOC87]

—software instruction counter [ASPLOS89]

—detecting data races [WPDD93, SC91, SC90]

• Parallel programming environments

—Parascope [PIEEE 93], Dsystem [TPDT94]

• Parallel applications

—molecular dynamics [JCC92]

59

Ongoing Work

• Global address space parallel languages

—Co-array Fortran [LCPC03]

• Performance analysis

— [TJS02, LACSI01, ICS01, SIGMETRICS01]

• Improving node performance

—irregular mesh and particle codes [ICS99, IJPP00]

—sparse matrices [LACSI02, IJHPCA04]

—multigrid [ICS01]

—dense matrices [LACSI03]

• Grid computing [IJHPCA01]

• Library-based domain languages [JPDC01]

taming high performance computing with compiler technology

Documents