taming high performance computing with compiler technology

59
John Mellor-Crummey Department of Computer Science Center for High Performance Software Research Taming High Performance Computing with Compiler Technology www.cs.rice.edu/~johnmc/presentations/rice-4-04.pdf

Upload: others

Post on 16-Oct-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Taming High Performance Computing with Compiler Technology

John Mellor-Crummey

Department of Computer ScienceCenter for High Performance Software Research

Taming High PerformanceComputing with

Compiler Technology

www.cs.rice.edu/~johnmc/presentations/rice-4-04.pdf

Page 2: Taming High Performance Computing with Compiler Technology

2

High Performance Computing Applications

• Scientific inquiry ranging from elementaryparticles to cosmology

• Pollution modeling and remediation planning

• Storm forecasting and climate prediction

• Advanced vehicle design

• Computational chemistry and drug design

• Molecular nanotechnology

• Cryptology

• Nuclear weapons stewardship

Page 3: Taming High Performance Computing with Compiler Technology

3

High Performance Applications

Algorithms

Architectures Data Structures

• Effective parallelizations ↔ scalability

• Single-processor performance can differ by integer factors

Page 4: Taming High Performance Computing with Compiler Technology

4

Status of Highly-parallel Systems

“[Scalable, highly-parallel, microprocessor-based systems] remain in the research andexperimental stage primarily because we

lack adequate software technology,application-development tools,

and, ultimately,well-developed applications.”

— “Information Technology Research: Investing in our Future” PITAC Report to the President, 1999

Page 5: Taming High Performance Computing with Compiler Technology

5

Challenges for Highly Parallel Computing

• Effective algorithms for complex problems

• Programming models and compilers

• Application development tools

• Operating systems for large-scale machines

• Design better high-performance architectures

Page 6: Taming High Performance Computing with Compiler Technology

6

Current Research Themes

• Compiler support for data parallel programming—Implicitly and explicitly parallel global address space languages

• Technology for auto-tuning software—Automatically tailor code to a microprocessor architecture

• Performance analysis tools—Understanding application behavior on current systems

• Performance modeling—How will applications perform at different scales and on future

systems

• Compiler technology for scientific scripting languages—R language for statistical programming

Page 7: Taming High Performance Computing with Compiler Technology

7

Outline

• Motivation

• Compiler technology for HPC

☛Compiling data-parallel languages

— Semi-automatic synthesis of performance models

• Challenges for the future

• Other work

Page 8: Taming High Performance Computing with Compiler Technology

8

Compiling data-parallel languages

• Introduction

— Data parallelism

— Compiling HPF-like languages

• Rice dHPF compiler

— Data partitioning research

— Analysis and code generation

• Experimental results

Page 9: Taming High Performance Computing with Compiler Technology

9

Data Parallelism

• Apply the same operation to many data elements

—need not be synchronous

—need not be completely uniform

• Applicable to many problems in science andengineering

Page 10: Taming High Performance Computing with Compiler Technology

10

Data Parallel Programming Alternatives

• Hand-coded parallelizations using library-based models—complete applicability

—difficult to design and implement

—all responsibility for tuning falls to the developer

• Application frameworks—easy to use

—limited applicability

• Single-threaded data-parallel languages—much more flexible than application frameworks

—much simpler to use than hand-coded parallelizations

—compilers significantly determines performance– offload details of tuning from the developer

– compilers are enormously complex

– out of luck if the compiler doesn’t deliver performance

Page 11: Taming High Performance Computing with Compiler Technology

11

Same answers assequential program

Partition computationInsert communicationManage storage

Parallel Machine

HPF Program Compilation

Fortran program+ data partitioning

Data Parallel Compilation

High Performance Fortran

Partitioning of data drives partitioning ofcomputation, communication, and synchronization

Page 12: Taming High Performance Computing with Compiler Technology

12

DO i = 2, n - 1 DO j = 2, n - 1 A(i,j) = .25 *(B(i-1,j) + B(i+1,j)+ B(i,j-1) + B(i,j+1))

CHPF$ processors P(3,3)CHPF$ distribute A(block, block) onto PCHPF$ distribute B(block, block) onto P

Processors

P(0,0)

P(2,2)

Data for A, B

(BLOCK,BLOCK)distribution

Example HPF Program

Page 13: Taming High Performance Computing with Compiler Technology

13

Compiling HPF-like Languages

• Partition data

• Select mapping of computation to processors

• Analyze communication requirements

• Partition computation by reducing loop bounds

• Insert communication

Page 14: Taming High Performance Computing with Compiler Technology

14

The Devil is in the Details …

• Good data and computation partitionings are a must

—without good partitionings, parallelism suffers!

• Excess communication undermines scalability

—both frequency and volume must be right!

• Single processor efficiency is critical

—must use caches effectively

—node code must be amenable to optimization

Goal Compiler and runtime techniques that

enable simple and natural programming, yet deliver the performance of hand-coded parallelizations

Page 15: Taming High Performance Computing with Compiler Technology

15

Achievements • parallelize sequential codes with minimal rewriting • near hand-coded performance for tightly coupled codes

Rice dHPF Compiler

Innovations

• Sophisticated data partitionings

• Abstract set-based framework for communicationanalysis, code generation

• Sophisticated computation partitionings

—partial replication to reduce communication

• Comprehensive optimizations

Page 16: Taming High Performance Computing with Compiler Technology

16

recurrences make parallelization difficult with BLOCK partitionings

do j = 1, n do i = 2,n a(i,j) = … a(i-1,j)

Data Partitioning

• Good parallel performance requires suitable partitioning

• Tightly-coupled computations are problematic

• Line-sweep computations: e.g., ADI integration

Page 17: Taming High Performance Computing with Compiler Technology

17

Coarse-Grain Pipelining

Processor 0

Processor 1

Processor 2

Processor 3

Partial serialization induces wavefront parallelism

with block partitioning

Compute along partitioned dimensions

Page 18: Taming High Performance Computing with Compiler Technology

18

Coarse-Grain Pipelining

Processor 0

Processor 1

Processor 2

Processor 3

Partial serialization induces wavefront parallelism

with block partitioning

Compute along partitioned dimensions

Page 19: Taming High Performance Computing with Compiler Technology

19

Hand-codedmultipartitioning}

}Compiler-generated

coarse-grainpipelining

Parallelizing Line Sweeps

Page 20: Taming High Performance Computing with Compiler Technology

20

Processor 0

Processor 1

Processor 2

Processor 3

Diagonal Multipartitioning

• Each processor owns 1 tile between each pair ofcuts along each distributed dimension

• Enables full parallelism for a sweep along anypartitioned dimension

Page 21: Taming High Performance Computing with Compiler Technology

21

Processor 0

Processor 1

Processor 2

Processor 3

Diagonal Multipartitioning

• Each processor owns 1 tile between each pair ofcuts along each distributed dimension

• Enables full parallelism for a sweep along anypartitioned dimension

Page 22: Taming High Performance Computing with Compiler Technology

22

Generalized Multipartitioning

• Partitioning constraints—# tiles in each λ - 1 dimensional hyperplane is a multiple of p—no more cuts than necessary

• Objective function: minimize communication volume

—pick the configuration of cuts to minimize total cross section

IPDPS 2002 Best paper in Algorithms; JPDC 2003

• Mapping constraints

— load balance: in a hyperplane, each proc has same # tiles

—neighbor: in any particular direction, the neighbor of a givenprocessor is the same

Given an n-dimensional data domain and p processors, select—which λ dimensions to partition, 2 ≤ λ ≤ n; how many cuts in each

Page 23: Taming High Performance Computing with Compiler Technology

23

Choosing the Best Partitioning

• Enumerate all elementary partitionings

—candidates depend on factorization of p

• Evaluate their communication cost

• Select the minimum cost partitioning

Number of choices for pickinga pair of dimensions

to partition with a number of cutsdivisible by a particular prime factor

Possible unique factors of p

( )( )( )

ppo

dd logloglog11

21

+

complexity:

very fast in practice.

worst case: p is a product of unique prime factors

Page 24: Taming High Performance Computing with Compiler Technology

24

Mapping Tiles with Modular Mappings

0 0

00

00

Basic Tile Shape

Modular Shift

Integral # of shapes

Integral # of shapes

Modular Shift

Page 25: Taming High Performance Computing with Compiler Technology

25

3 types of Sets

DataIterationsProcessors

3 types of Mappings

iterations data

dataprocessors

processorsiterations

Layout:Reference:CompPart:

Formal Compilation Framework

• Representation

—integer tuples with Presburger arithmetic for constraints

• Analysis: Use set equations to compute set(s) of interest

— iterations allocated to a processor

—communication sets

• Code generation: Synthesize loops from set(s), e.g.

—parallel (SPMD) loop nests

—message packing and unpacking

[Adve & Mellor-Crummey, PLDI98]

Page 26: Taming High Performance Computing with Compiler Technology

26

processors P(3,3)distribute A(block, block) onto Pdistribute B(block, block) onto PDO i = 2, n - 1 DO j = 2, n - 1 A(i, j) = .25 *( B(i-1, j) + B(i+1, j) + B(i, j-1) + B(i, j+1) )

P(x,y)

Local section for P(x,y)(and iterations executed)

Non-local dataaccessed

Iterations that accessnon-local data

} 2930yj230y

1920xi220x :j][i, {

+≤≤+&

+≤≤+

data / loop partitioning

20

30

P(0,0) P(1,0) P(2,0)

P(0,1) P(1,1) P(2,1)

P(0,2) P(1,2) P(2,2)

Why Symbolic Sets?

Page 27: Taming High Performance Computing with Compiler Technology

27

symbolic N

Layout := { [pid] -> [i] : 25 *pid + 1 ≤ i ≤ 25 *pid + 25 }

Loop := { [i] : 1 ≤ i ≤ N }

CPSubscript := { [i] [i-1] }

RefSubscript := { [i] [i-2] }

real A(100) distribute A(BLOCK) on P(4) do i = 1, N ... = A(i-1) + A(i-2) + ... ! ON_HOME A(i-1) enddo

CompPart := (Layout o CPSubscript -1) ∩ Loop

DataAccessed = CompPart o RefSubscript

NonLocal Data Accessed = DataAccessed - Layout

Integer-Set Framework: Example

Page 28: Taming High Performance Computing with Compiler Technology

28

Optimizations using Integer Sets

• Partially replicate computation to reduce communication

—66% lower message volume, 38% faster: NAS BT @ 64 procs

• Coalesce communication sets for multiple references

—41% lower message volume, 35% faster: NAS SP @ 64 procs

• Split loops into “local-only” and “off-processor” loops

—10% fewer Dcache misses, 9% faster: NAS SP @64 procs

• Processor set constraints on communication sets

—12% fewer Icache misses, 7% faster: NAS SP @ 64 procs

PACT 2002 Best student paper(with Daniel Chavarria-Miranda)

Page 29: Taming High Performance Computing with Compiler Technology

29

• NAS SP & BT benchmarks from NASA Ames—use ADI to solve the Navier-Stokes equation in 3D

—forward & backward line sweeps on each dimension, each timestep

• Compare four variants—MPI hand-coded multipartitioning (NASA)

—dHPF: multipartitioned

—dHPF: 2D partitioning, coarse-grain pipelining

—PGI’s pghpf: 1D partitioning with transpose

• Platform—SGI Origin 2000: 128 250 MHz procs.

—SGI compilers + SGI MPI

Experimental Evaluation

Page 30: Taming High Performance Computing with Compiler Technology

30

Efficiency for NAS SP (1023 ‘B’ size)

> 2x multipartitioning comm. volume

similar comm. volume, more serialization

Page 31: Taming High Performance Computing with Compiler Technology

31

Efficiency for NAS BT (1023 ‘B’ size)

> 2x multipartitioning comm. volume

Platform: SGI Origin 2000

Page 32: Taming High Performance Computing with Compiler Technology

32

NAS BT Parallelizations

Hand-coded3D Multipartitioning

Compiler-generated3D Multipartitioning

Execution Traces for NAS BT Class 'A' - 16 processors, SGI Origin 2000

Page 33: Taming High Performance Computing with Compiler Technology

33

Observations

• High performance requires perfection

—parallelism and load-balance

—communication frequency

—communication volume

—scalar performance

• Data-parallel compiler technology can

—ease the programming burden

—yield near hand-coded performance

Page 34: Taming High Performance Computing with Compiler Technology

34

Data-parallel Related Work

• Linear equations/set-based compilation

—[Pugh et al; Ancourt et al; Amarasinghe & Lam]

• Commercial HPF compilers

—xlHPF, pghpf, xHPF

• HPF/JA

—14 Teraflops on a code for the Earth Simulator

• Lots of research compiler efforts

—e.g. Polaris, CAPTOOLS

None support partially-replicated computationNone support multipartitioningNone achieve linear scaling on tightly-coupled codes

Page 35: Taming High Performance Computing with Compiler Technology

35

Outline

• Motivation

• Compiler technology for HPC

— Data-parallel programming systems

☛Semi-automatic synthesis of performance models

• Challenges for the future

• Other work

Page 36: Taming High Performance Computing with Compiler Technology

36

Why Performance Modeling?

• Insight into applications

—barriers to scalability

—insight into optimizations

• Mapping applications to systems

—Grid resource selection & scheduling

—intelligent run-time adaptation

• Workload-based design of future systems

Page 37: Taming High Performance Computing with Compiler Technology

37

Modeling Challenges

• Performance depends on:

—architecture specific factors

—application characteristics

—input data parameters

• Difficult to model execution time directly

• Collecting data at scale is expensive

Page 38: Taming High Performance Computing with Compiler Technology

38

Approach

Separate contribution of application characteristics

• Measure the application-specific factors

—static analysis

—dynamic analysis

• Construct scalable models

• Explore interactions with hardware

Use binary analysis and instrumentation forlanguage and programming model independence

[Marin & Mellor-Crummey SIGMETRICS 04]

Page 39: Taming High Performance Computing with Compiler Technology

39

Toolkit Design Overview

ObjectCode

BinaryInstrumenter

InstrumentedCode

Execute

BBCounts

CommunicationVolume &Frequency

MemoryReuse

Distance

BinaryAnalyzer

Control flow graph

Loop nestingstructure

BB instruction mixPost Processing Tool

Architectureneutral model Scheduler

ArchitectureDescription

PerformancePredictionfor Target

ArchitectureStatic Analysis

DynamicAnalysis

Post Processing

Page 40: Taming High Performance Computing with Compiler Technology

40

Building Scalable Models

• Collect data from multiple runs

—n+1 runs to compute a model of degree n

• Approximation function:

F(X) = cn*Bn(X)+cn-1*Bn-1(X)+…+c0*B0(X)

• A set of basis functions

• Include constraints

• Goal: determine coefficients

Use quadratic programming

Page 41: Taming High Performance Computing with Compiler Technology

41

Execution Frequency Modeling Example

Execution Frequency Model

0

200000

400000

600000

800000

1000000

0 5 10 15 20 25 30 35 40 45

Problem Size

Fre

qu

en

cy

Collected data

289200

23 …

18316013110453020202445784Count

1815952X

Page 42: Taming High Performance Computing with Compiler Technology

42

Execution Frequency Modeling Example

Execution Frequency Model

0

200000

400000

600000

800000

1000000

0 5 10 15 20 25 30 35 40 45

Problem Size

Fre

qu

en

cy

Collected data

Model degree 0

Y=41416, Err=131%

289200

23 …

18316013110453020202445784Count

1815952X

Page 43: Taming High Performance Computing with Compiler Technology

43

Execution Frequency Modeling Example

Execution Frequency Model

0

200000

400000

600000

800000

1000000

0 5 10 15 20 25 30 35 40 45

Problem Size

Fre

qu

en

cy

Collected data

Model degree 0

Model degree 1

Y=41416, Err=131%

Y=16776*X-42366,Err=60.4%

289200

23 …

18316013110453020202445784Count

1815952X

Page 44: Taming High Performance Computing with Compiler Technology

44

Execution Frequency Modeling Example

Execution Frequency Model

0

200000

400000

600000

800000

1000000

0 5 10 15 20 25 30 35 40 45

Problem Size

Fre

qu

en

cy

Collected data

Model degree 0

Model degree 1

Model degree 2

Y=41416, Err=131%

Y=16776*X-42366,Err=60.4%

Y=482*X2+1446*X+964,Err=0%

289200

23 …

18316013110453020202445784Count

1815952X

Page 45: Taming High Performance Computing with Compiler Technology

45

Predict Schedule Latency for an Architecture

• Input:

—basic block and edge execution frequency

• Methodology:

—recover executed paths

—SPARC instructions ➔ generic RISC

—instantiate scheduler for architecture

—construct schedule for executed paths

—determine inefficiencies

Page 46: Taming High Performance Computing with Compiler Technology

46

Toolkit Design Overview

ObjectCode

BinaryInstrumenter

InstrumentedCode

Execute

BBCounts

CommunicationVolume &Frequency

MemoryReuse

Distance

BinaryAnalyzer

Control flow graph

Loop nestingstructure

BB instruction mixPost Processing Tool

Architectureneutral model Scheduler

ArchitectureDescription

PerformancePredictionfor Target

ArchitectureStatic Analysis

DynamicAnalysis

Post Processing

Page 47: Taming High Performance Computing with Compiler Technology

47

Memory Reuse Distance

• MRD: # unique data blocks referenced sincetarget block last accessed

memory block

MRD

• I1: 1 cold miss

• I2: 2 cold misses, 1 @ distance 2

• I3: 1 @ distance 0, 2 @ distance 1

reference

0

B

I3

2

B

I2

1

A

I3

∞C

I2

1

A

I3

∞B

I2

∞A

I1

Page 48: Taming High Performance Computing with Compiler Technology

48

Memory reuse distance

Page 49: Taming High Performance Computing with Compiler Technology

49

Modeling Memory Reuse Distance

• More complex than execution frequency

—cold misses

—histogram of reuse distances– number of bins not constant

• Average reuse distance is misleading

—1 access with distance 10,000

—3 accesses with distance 0

—cache has 1024 blocks

2500 average

Page 50: Taming High Performance Computing with Compiler Technology

50

Modeling Memory Reuse Distance

2 13 40

50%

30%20%

Reuse distance

No

rmal

ized

freq

uen

cy

Page 51: Taming High Performance Computing with Compiler Technology

51

Modeling Memory Reuse Distance

Page 52: Taming High Performance Computing with Compiler Technology

52

Predict Number of Cache Misses

• Instantiate model for problem size 100

74%

96%

Page 53: Taming High Performance Computing with Compiler Technology

53

Prediction: NAS BT 3.0 Mem Hier Utilization

NAS BT 3.0 Memory Hierarchy Utilization

0

50

100

150

200

250

300

0 20 40 60 80 100 120 140 160 180 200

Mesh size

Mis

s c

ou

nt

/ C

ell

/ T

ime s

tep

L1 measuredL1 predictedL2 measured(x10)L2 predicted(x10)TLB measured(x10)TLB predicted(x10)

Page 54: Taming High Performance Computing with Compiler Technology

54

Prediction: NAS BT 3.0 Time on SGI Origin

0500

10001500200025003000350040004500

Measured time Schedulerlatency

L1 miss penalty

NAS BT 3.0 from SPARC to SGI Origin

0

1000

2000

3000

4000

5000

6000

7000

0 50 100 150 200Mesh size

Cycle

s /

Cell /

Tim

e s

tep

Measured time Scheduler latency L1 miss penaltyL2 miss penalty TLB miss penalty Predicted time

Page 55: Taming High Performance Computing with Compiler Technology

55

Open Performance Modeling Issues

• Short term

—Better modeling of memory subsystem– # outstanding loads to accurately predict memory latency

—Explore modeling of irregular applications

• Long term

—Model parallel applications– Present modeling applies between synchronization points

– Combine with manually constructed parallel models

– Semi-automatically recover parallel trends

—Understand dynamic parallelism

Page 56: Taming High Performance Computing with Compiler Technology

56

Modeling Related Work

• Reuse distance—Cache utilization [Beyls & D’Hollander]

—Investigating optimizations [Ding et al.]

• Program instrumentation—EEL, QPT [Ball, Larus, Schnarr]

• Scalable analytic models—[Vernon et al; Hoisie et al.]

• Cross-architecture models at scale—[Snavely et al.; Cascaval et al.]

• Simulation (trace-based and execution-driven)

None yield semi-automatically derived scalable models

Page 57: Taming High Performance Computing with Compiler Technology

57

HPC Compiler Challenges for the Future

• Programming systems for large-scale machines—Abstraction and greater expressiveness are needed

—Potential parallelism must be readily accessible– implicit parallelism or explicit element-wise parallelism

—Locality and latency tolerance are both critical for performance

—Dynamic self-scheduled parallelism will be necessary

—Failure will occur and must be expected and handled

• Support for “self-tuning software” for complex architectures

• Compiler-based tools—Debugging and performance analysis of large-scale software

on dynamic systems is a major open problem

• Insight into hardware design—Understanding impact of proposed designs on whole programs

Page 58: Taming High Performance Computing with Compiler Technology

58

Past Work

• Multiprocessor synchronization

—locks, synchronous barriers [ASPLOS89, TOCS91]

—reader-writer synchronization [PPOPP91]

—fuzzy barriers [IJPP94]

• Parallel debugging

—execution replay [JPDC90, TOC87]

—software instruction counter [ASPLOS89]

—detecting data races [WPDD93, SC91, SC90]

• Parallel programming environments

—Parascope [PIEEE 93], Dsystem [TPDT94]

• Parallel applications

—molecular dynamics [JCC92]

Page 59: Taming High Performance Computing with Compiler Technology

59

Ongoing Work

• Global address space parallel languages

—Co-array Fortran [LCPC03]

• Performance analysis

— [TJS02, LACSI01, ICS01, SIGMETRICS01]

• Improving node performance

—irregular mesh and particle codes [ICS99, IJPP00]

—sparse matrices [LACSI02, IJHPCA04]

—multigrid [ICS01]

—dense matrices [LACSI03]

• Grid computing [IJHPCA01]

• Library-based domain languages [JPDC01]