cr18: advanced compilers l08: memory & power tomofumi yuki

71
CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

Upload: janis-newman

Post on 29-Jan-2016

228 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

CR18: Advanced Compilers

L08: Memory & Power

Tomofumi Yuki

Page 2: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

2

Memory Expansion

Recall Array Dataflow Analysis start from loops get value-based dependences correspond to Alpha = no notion of

memory

It is sometimes called Full Array Expansion explicit dependences with single

assignment full parallelism exposed

Page 3: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

3

Memory vs Parallelism

More parallelism requires more memory obvious example: scalar accumulation

One approach: ignore the problem by using memory-based dependences

Alternatively, we can try to find memory allocation afterwards

Page 4: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

4

Memory Allocation

Given a schedule: Memory Reuse Analysis [1996] Lefebvre-Feautrier [1998] Quilleré-Rajopadhye [2000] Lattice-Based [2005]

For a set of schedules Universal Occupancy Vectors [1998] Affine Universal Occupancy Vectors

[2001] Quasi-Universal Occupancy Vectors

[2013]

Page 5: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

5

Occupancy Vectors

Main Concept: A vector (in the iteration space) that

gives another iteration that can safely overwrite

Universal OV: OV that is legal for any schedule affine and quasi- variants restrict the

universe to smaller subset

Page 6: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

6

Universal Occupancy Vectors Only for uniform dependence

all iterations have the same dependence pattern

large enough domain (no thin strips)

Key Idea: Transitivity some iteration z can overwrite z’ if z depends on all uses of z’ → possibly transitively

Page 7: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

7

UOV Example

Find an UOV for the following

i

j

Page 8: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

8

UOV Example

Find an UOV for the following is [1,1] a valid UOV? how does it translate to memory

mapping?

i

j

Page 9: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

9

UOV Example

Find an UOV for the following how about [1,0]?

i

j

Page 10: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

10

UOV Example

Find an UOV for the following

i

j

Page 11: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

11

UOV Example

Alternative Formulation as intersection of transitive closures

i

j

Page 12: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

12

Affine UOV Example

Restrict to affine schedules but allow affine dependences

i

j

Page 13: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

13

Relevance of UOVs

UOV allocates d-1 dimensional array for d-dimensional space

Does this sound like a problem?

What can you say about programs with only uniform dependences?

How does this relate to tiling?

Page 14: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

14

Memory Allocation/Contraction We are given an affine schedule θ

per statement possibly multi-dimensional

Problem: find affine pseudo-projections

affine function + modulo factors per statement usually minimizing the memory usage

Page 15: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

15

Pseudo Projection

Assume lex. order as schedule what is a valid OV?

i

j

Page 16: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

16

Pseudo Projection

Assume lex. order as schedule what is a valid OV? [0,2], which translates to

i

j

for i for j A[i%2,j] = foo(A[(i-1)%2,j], A[i%2,j-1]);

Page 17: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

17

Allocation vs Contraction

Most programs have: much more statements than arrays

Memory allocation techniques: map each statement to its own array try to merge arrays afterwards

Array contraction techniques: keeps the original statement-to-array

mapping

Little difference in the theory behind

Page 18: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

18

Live-ness of Values

Central analysis in memory allocation live-ness analysis in register allocation called by different names

Given: value computed at S(z), used by T(z’) we cannot overwrite the value of S(z) written at θ(S,z) until θ(T,z’)

forall T,z’

Page 19: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

19

Computing the Live-ness

How to compute the live-ness? θ(i,j) = i θ(i,j) = i+j

i

j

Page 20: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

20

Lefebvre-Feautrier

How to find the allocation? θ(i,j) = i 1. Start with scalar

2. Expand in a dimension

3. Use max reuse distanceas modulo factor

i

j

Page 21: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

21

Lefebvre-Feautrier

Alternative Description

1. Start with full array

2. Project in a dimension3. Compute modulo factor

i

j

Page 22: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

22

Quilleré-Rajopadhye

Based on non-Canonic Projections

Main Result: Optimality for a d-D space if you find x independent projections what can you say about memory usage?

Page 23: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

23

Lattice-Based Allocation

Different formulation using lattices Consider some basis of an integer lattice

i

j

Page 24: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

24

Lattice-Based Allocation

Different formulation using lattices Consider some basis of an integer lattice

i

j

Page 25: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

25

Lattice-Based Allocation

Different formulation using lattices Consider some basis of an integer lattice

i

j

Page 26: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

26

Lattice-Based Allocation

Lattices ≈ Occupancy Vectors Conflict Set

values that cannot be mapped to the same memory locations

Find the smallest lattice that is large enough to only intersect with the

conflict set at its base enumeration of the space using HNF

Page 27: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

27

Page 28: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

28

Energy-Aware Compilation

Power Wall

Page 29: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

Power Density

Lead to multi-core

Saving Energy is important Barrier for Exa-Scale computing Battery lifetime of laptops

Compiler Optimization has focused on speed Is there anything compilers can do for

energy? Speed is still important 29

Page 30: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

Starting Hypothesis

Energy is Power consumed over Time E = PT

P : Power consumption E : Energy consumption T : execution Time

Faster execution time =Lower energy consumption

Hypothesis : Optimizing for speed also optimizes energy

30

Page 31: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

Single Processor Case

Two main categories Purely Program Transformations

Efficient use of data cache Energy Aware Compilation framework

Dynamic Voltage and Frequency Scaling Profile based Loop transformation + DVFS

31

Page 32: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

Efficient use of data cache

HW with configurable cache line size Trade-off : larger CLS =>

better spatial locality, higher interference

Main Contribution: Model to maximize hit ratio

Configurable CLS leads to energy trade-off

In GP processors, data locality optimization ≈ energy optimization of cache

32

[D’Alberto et. al, 2001]

Page 33: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

Energy Aware Compilation

Compiler framework with energy in mind Based on predicting power consumption

from high-level source code Energy-Aware Tiling

Optimal tiling strategy for speed != for energy

Key : tiling adds instructions Main Weakness

Improvement is relatively small (~10%) Energy is traded with speed

33

[Kadayif et. al, 2002]

Page 34: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

Results by Kadayif et. al

Increase in energy/execution cycle when optimized for the other

Energy delay product would not change much

34

fir conv lms real biquad

complex

mxm

vpentai

Energy

4.1%

7.7% 6.8% 3.9% 2.0% 8.8% 5.9%

7.3%

Cycle 5.9%

8.7% 7.2% 2.9% 2.3% 7.6% 9.2%

6.8%

Page 35: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

HW for Further Optimization

Dynamic Voltage and Frequency Scaling Power consumption model for CMOS

Voltage is the obvious target high frequency requires high voltage quadratic energy savings with reduced

freq.

35

V : supply voltagef : frequencyα : activity rate

Page 36: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

DVFS : Main Idea

Identify non-compute intensive stages Frequency/Voltage can be reduced

without influencing speed processor is under-utilized

DVFS states are coarse grained ~10 different frequency/voltage

configurations State transition is not free

100s of cycles extra energy consumed

36

Page 37: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

DVFS : Single Processor

Profile Based Profile to identify opportunities Compile-time vs. Run-time Limited by available opportunities

Loop Transformation First, optimize for speed Then convert speedup to energy savings Transformation to expose opportunities

37

[Hsu and Kremer 2003, Hsu and Feng 2005]

[Ghodrat and Gvargis 2009]

Page 38: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

DVFS : Single Processor

Task-Based Programs Main Ides: Decoupled Access/Execute

Compiler transformation to split into tasks

One that does memory Accesses to fetch data

Another that does Execute to compute Apply DVFS

low frequency for Access high frequency for Execute

38

[Jimborean et al. 2014]

Page 39: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

Single Processor : Summary

Purely software based optimization No significant gains over optimizing for

speed Hypothesis holds in this case

DVFS based approaches HW for energy savings exposed to

software Identify when processors is not fully

utilized HW support breaks the hypothesis

39

Page 40: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

Across Processors

Parallelization is necessary to utilize modern architectures

How does parallelism affect energy?

Amdahl’s Law for Energy Opportunities in parallel programs

40

Page 41: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

Static Power

New Term to the Power Model

Some power is consumed even when idle

DVFS has less effect Static Power reaching 50% of the total

power

41

I : leakage current

static powerdynamic power

Page 42: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

Amdahl’s Law for Energy

Simple model of energy and parallelism processors have DVFS

Simple but more complicated than the original

Speed-up energy trade-off analysis

42

[Cho and Melhem 2008]

s : sequential p : parallel N : number of processorsλ : static power y : power consumption as a function of frequency

seq dynamic parallel dynamic static

Page 43: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

Illustrating example from paper

43

(frequency)

Page 44: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

When Static Power is 50%

44

(frequency)

Page 45: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

Static Power dominates

Static Power is significant Increases as N increases Excessive processors are bad

With current technology (high static power and increasing cores) Running as fast as possible is a good

way to save energy

45

Page 46: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

Generalizing a bit Further

Analysis based on high-level energy model Emphasis on power breakdown Find when “race-to-sleep” is the best Survey power breakdown of recent

machines Goal: confirm that sophisticated use of

DVFS by compilers is not likely to help much e.g., analysis/transformation to

find/expose “sweet-spot” for trading speed with energy 46

Page 47: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

Power Breakdown

Dynamic (Pd)—consumed when bits flip Quadratic savings as voltage scales

Static (Ps)—leaked while current is flowing Linear savings as voltage scales

Constant (Pc)—everything else e.g., memory, motherboard, disk,

network card, power supply, cooling, … Little or no effect from voltage scaling

47

Page 48: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

Influence on Execution Time

Voltage and Frequency are linearly related Slope is less than 1 i.e., scale voltage by half, frequency

drop is less than half Simplifying Assumption

Frequency change directly influence exec. time

Scale frequency by x, time becomes 1/x Fully flexible (continuous) scaling

Small set of discrete states in practice48

Page 49: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

Case1: Dynamic Dominates Power Time

Case2: Static Dominates Power Time

Case3: Constant Dominates Power Time

Ratio is the Key

49

Pd : Ps : Pc

Pd : Ps : Pc

Pd : Ps : Pc

Pd : Ps : Pc

Energy Slower the Better

Energy No harm, but No gain

Energy Faster the Better

Page 50: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

When do we have Case 3?

Static power is now more than dynamic power Power gating doesn’t help when

computing Assume Pd = Ps

50% of CPU power is due to leakage Roughly matches 45nm technology Further shrink = even more leakage

The borderline is when Pd = Ps = Pc We have case 3 when Pc is larger than

Pd=Ps 50

Page 51: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

Extensions to The Model

Impact on Execution Time May not be directly proportional to

frequency Shifts the borderline in favor of DVFS

Larger Ps and/or Pc required for Case 3

Parallelism No influence on result CPU power is even less significant than

1-core Power budget for a chip is shared (multi-

core) Network cost is added (distributed) 51

Page 52: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

Do we have Case 3?

Survey of machines and significance of Pc

Based on: Published power budget (TDP) Published power measures Not on detailed/individual

measurements Conservative Assumptions

Use upper bound for CPU Use lower bound for constant powers Assume high PSU efficiency 52

Page 53: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

Pc in Current Machines

Sources of Constant Power Stand-By Memory (1W/1GB)

Memory cannot go idle while CPU is working

Power Supply Unit (10-20% loss) Transforming AC to DC

Motherboard (6W) Cooling Fan (10-15W)

Fully active when CPU is working Desktop Processor TDP ranges from 40-

90W Up to 130W for large core count (8 or

16)

53

Page 54: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

Sever and Desktop Machines Methodology

Compute a lower bound of Pc

Does it exceed 33% of total system power?

Then Case 3 holds even if the rest was all consumed by the processor

System load Desktop: compute-intensive benchmarks Sever: Server workloads

(not as compute-intensive)

54

Page 55: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

Desktop and Server Machines

55

Page 56: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

Cray Supercomputers

Methodology Let Pd+Ps be sum of processors TDPs Let Pc be the sum of

PSU loss (5%) Cooling (10%) Memory (1W/1GB)

Check if Pc exceeds Pd = Ps Two cases for memory configuration

(min/max)

56

Page 57: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

Cray Supercomputers

XT5 (min)

XT5 (max)

XT6 (min)

XT6 (max)

XE6 (min)

XE6 (max)

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

OtherPSU+CoolingMemoryCPU-staticCPU-dynamic

57

Page 58: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

Cray Supercomputers

XT5 (min)

XT5 (max)

XT6 (min)

XT6 (max)

XE6 (min)

XE6 (max)

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

OtherPSU+CoolingMemoryCPU-staticCPU-dynamic

58

Page 59: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

Cray Supercomputers

XT5 (min)

XT5 (max)

XT6 (min)

XT6 (max)

XE6 (min)

XE6 (max)

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

OtherPSU+CoolingMemoryCPU-staticCPU-dynamic

59

Page 60: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

DVFS for Memory

Still in research stage (since 2010~) Same principle applied to memory

Quadratic component in power w.r.t. voltage

25% quadratic, 75% linear The model can be adopted:

Pd becomes Pq dynamic to quadratic Ps becomes Pl static to linear

The same story but with Pq : Pl : Pc

60

Page 61: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

Influence on “race-to-sleep”

Methodology Move memory power from Pc to Pq and

Pl

25% to Pq and 75% to Pl

Pc becomes 15% of total power for Server/Cray

“race-to-sleep” may not be the best anymore

remains to be around 30% for desktop Vary Pq:Pl ratio to find when “race-to-

sleep” is the winner again leakage is expected to keep increasing

61

Page 62: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

When “Race to Sleep” is optimal When derivative of energy w.r.t. scaling

is >0

62

dE/dF

Linearly Scaling Fraction: Pl / (Pq + Pl)

Page 63: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

Summary and Conclusion

Diminishing returns of DVFS Main reason is leakage power Confirmation by a high-level energy

model “race-to-speed” seems to be the way to

go Memory DVFS won’t change the big

picture Compilers can continue to focus on

speed No significant gain in energy efficiency

by sacrificing speed 63

Page 64: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

Balancing Computation and I/O DVFS can improve energy efficiency

when speed is not sacrificed Bring program to compute-I/O balanced

state If it’s memory-bound, slow down CPU If it’s compute-bound, slow down

memory Still maximizing hardware utilization

but by lowering the hardware capability Current hardware (e.g., Intel Turbo-

boost) and/or OS do this for processor

64

Page 65: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

65

Page 66: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

66

The Punch Line Method

How to Punch your audience how to attract your audience

Make your talk more effective learned from Michelle Strout

Colorado State University applicable to any talk

exce

llent

aver

age

good

poor

Norm

al Ta

lk

Punch

Lin

e T

alk

Page 67: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

67

The Punch Line

The key cool idea in your paper the key insight

It is not the key contribution! X% better than Y do well on all benchmarks

Examples: ... because of HW prefetching ... further improve locality after reaching

compute-bound

Page 68: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

68

Typical Conference Audience Many things to do

check emails browse websites finish their own slides

Attention Level (made up numbers)

~3 minutes 90% ~5 minutes 60% 5+ minutes 30% conclusion 70%

punch here!

push these numbers up!

Page 69: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

69

Typical (Boring) Talk

1. Introduction 2. Motivation 3. Background 4. Approach 5. Results 6. Discussion 7. Conclusion

Page 70: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

70

Punch Line Talk

Two Talks in One 5 minute talk

introduction/motivation key idea

X-5 minute talk add some background elaborate on approach ...

the punch

shortest path to the punch

Page 71: CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

71

Pitfalls of Beamer

Beamer != bad slides but it is a easy path to one

Checklist for good slides no full sentences LARGE font size few equations many figures !paper structure

beamer is not the best tool to encourage these