processor architectures and program mapping

46
Processor Architectures and Program Mapping TU/e 5kk10 Henk Corporaal Jef van Meerbergen Bart Mesman Exploiting ILP part 2: code generation

Upload: dalton-bradley

Post on 31-Dec-2015

32 views

Category:

Documents


1 download

DESCRIPTION

Processor Architectures and Program Mapping. Exploiting ILP part 2: code generation. TU/e 5kk10 Henk Corporaal Jef van Meerbergen Bart Mesman. Overview. Enhance performance: architecture methods Instruction Level Parallelism VLIW Examples C6 TM TTA Clustering Code generation - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Processor Architectures and Program Mapping

Processor Architectures and Program Mapping

TU/e 5kk10Henk Corporaal

Jef van Meerbergen

Bart Mesman

Exploiting ILPpart 2: code generation

Page 2: Processor Architectures and Program Mapping

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

2

Overview• Enhance performance: architecture methods• Instruction Level Parallelism• VLIW• Examples

– C6

– TM

– TTA

• Clustering• Code generation• Hands-on

Page 3: Processor Architectures and Program Mapping

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

3

Compiler basics

• Overview– Compiler trajectory / structure / passes– Control Flow Graph (CFG)– Mapping and Scheduling– Basic block list scheduling– Extended scheduling scope– Loop schedulin

Page 4: Processor Architectures and Program Mapping

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

4

Compiler basics: trajectory

Preprocessor

Compiler

Assembler

Loader/Linker

Source program

Object program

Error messages

Library code

Page 5: Processor Architectures and Program Mapping

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

5

Compiler basics: structure / passes

Lexical analyzer

Parsing

Code optimization

Register allocation

Source code

Sequential code

Intermediate code

Code generation

Scheduling and allocation

Object code

token generation

check syntax check semantic parse tree generation

data flow analysis local optimizations global optimizationscode selection peephole optimizations

making interference graph graph coloring spill code insertion caller / callee save and restore code

exploiting ILP

Page 6: Processor Architectures and Program Mapping

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

6

Compiler basics: structure Simple compilation example

Lexical analyzer

Syntax analyzer

Intermediate code generator

position := initial + rate * 60

id := id + id * 60

:=

+id

*id

60id

Code optimizer

Code generator

temp1 := intoreal(60)temp2 := id3 * temp1temp3 := id2 + temp2id1 := temp3

temp1 := id3 * 60.0id1 := id2 + temp1

movf id3, r2mulf #60, r2, r2movf id2, r1addf r2, r1movf r1, id1

Page 7: Processor Architectures and Program Mapping

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

7

Compiler basics: Control flow graph (CFG)

C input code:

CFG: 1 sub t1, a, b bgz t1, 2, 3

4 ………….. …………..

3 rem r, b, a goto 4

2 rem r, a, b goto 4

Program, is collection of Functions, each function is collection of Basic Blocks, each BB contains set of Instructions, each instruction consists of several Transports,..

if (a > b) { r = a % b; } else { r = b % a; }

Page 8: Processor Architectures and Program Mapping

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

8

Mapping / Scheduling: placing operations in space and time

d = a * b;

e = a + d;

f = 2 * b + d;

r = f – e;

x = z + y;

* *

+ +

-

+

a b 2

z yd

e f

r

x

Data Dependence Graph (DDG)

Page 9: Processor Architectures and Program Mapping

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

9

How to map these operations?

* *

+ +

-+

a b 2

z y

d

e f

rx

Architecture constraints:• One Function Unit• All operations single cycle latency

*

*

+

+

-

+

cycle 1

2

3

4

5

6

Page 10: Processor Architectures and Program Mapping

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

10

How to map these operations?

* *

+ +

-+

a b 2

z y

d

e f

rx

Architecture constraints:• One Add-sub and one Mul unit• All operations single cycle latency

*

* +

+

-

+cycle 1

2

3

4

5

6

Mul Add-sub

Page 11: Processor Architectures and Program Mapping

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

11

There are many mapping solutions

Pareto curve(solution space)

T e

xecu

tion

x

x

x

x

xx

x

xx

x

x

x

x

x

x

xxx

x

x

xx

x

x

x

x

x

xx

x

xx

Cost0

Page 12: Processor Architectures and Program Mapping

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

12

Basic Block Scheduling

• Make a dependence graph

• Determine minimal length

• Determine ASAP, ALAP, and slack of each operation

• Place each operation in first cycle with sufficient resources

Note:– Scheduling order sequential

– Priority determined by used heuristic; e.g. slack

Page 13: Processor Architectures and Program Mapping

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

13

Basic Block Scheduling

ADD

LD

A C

y

<1,3>

<2,4>MUL

A B

z

<1,4>

ADD

ADD

SUB

NEG LD

A

B C

X

<3,3>

<4,4>

<2,2>

<2,3>

<1,1>

ASAP cycle

ALAP cycle

slack

Page 14: Processor Architectures and Program Mapping

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

14

Cycle based list schedulingproc Schedule(DDG = (V,E))beginproc ready = { v | (u,v) E } ready’ = ready sched = current_cycle = 0 while sched V do for each v ready’ do if ResourceConfl(v,current_cycle, sched) then cycle(v) = current_cycle sched = sched {v} endif endfor current_cycle = current_cycle + 1 ready = { v | v sched (u,v) E, u sched } ready’ = { v | v ready (u,v) E, cycle(u) + delay(u,v) current_cycle} endwhileendproc

Page 15: Processor Architectures and Program Mapping

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

15

Extended basic block scheduling: Code Motion

A a) add r4, r4, 4 b) beq . . .

D e) st r1, 8(r4)

C d) sub r1, r1, r2

B c) add r1, r1, r2

• Downward code motions?

— a B, a C, a D, c D, d D

• Upward code motions?

— c A, d A, e B, e C, e A

Page 16: Processor Architectures and Program Mapping

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

16

Extended Scheduling scope

A

C

F

B

D

E

G

A;If cond Then B Else C;D;If cond Then E Else F;G;

Code: CFG:ControlFlowGraph

Page 17: Processor Architectures and Program Mapping

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

17

Scheduling scopes

Trace Superblock Decision tree Hyperblock/region

Page 18: Processor Architectures and Program Mapping

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

18

Code movement (upwards) within regions

I

I I

add

I

source block

destination block

I

Copy needed

Intermediateblock

Check foroff-liveness

Legend:

Code movement

Page 19: Processor Architectures and Program Mapping

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

19

Extended basic block scheduling:Code Motion

• A dominates B A is always executed before B– Consequently:

• A does not dominate B code motion from B to A requires

code duplication

• B post-dominates A B is always executed after A– Consequently:

• B does not post-dominate A code motion from B to A is speculative

A

CB

ED

F

Q1: does C dominate E?

Q2: does C dominate D?

Q3: does F post-dominate D?

Q4: does D post-dominate B?

Page 20: Processor Architectures and Program Mapping

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

20

Scheduling: Loops

B C

D

A

B

C’’

D

A

C’

C B

C’’

D

A

C’

C

Loop peeling Loop unrolling

Loop Optimizations:

Page 21: Processor Architectures and Program Mapping

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

21

Scheduling: LoopsProblems with unrolling:

• Exploits only parallelism within sets of n iterations

• Iteration start-up latency

• Code expansion

Basic block scheduling

Basic block scheduling and unrolling

Software pipelining

reso

urc

e u

tiliz

atio

n

time

Page 22: Processor Architectures and Program Mapping

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

22

Software pipelining• Software pipelining a loop is:

– Scheduling the loop such that iterations start before preceding iterations have finished

Or:– Moving operations across the backedge

LD

ML

ST

LD

LD ML

LD ML ST

ML ST

ST

LD

LD ML

LD ML ST

ML ST

ST

Example: y = a.x

3 cycles/iteration Unroling

5/3 cycles/iteration

Software pipelining

1 cycle/iteration

Page 23: Processor Architectures and Program Mapping

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

23

Software pipelining (cont’d)Basic techniques:

• Modulo scheduling (Rau, Lam)– list scheduling with modulo resource constraints

• Kernel recognition techniques– unroll the loop

– schedule the iterations

– identify a repeating pattern

– Examples:• Perfect pipelining (Aiken and Nicolau)

• URPR (Su, Ding and Xia)

• Petri net pipelining (Allan)

• Enhanced pipeline scheduling (Ebcioğlu)– fill first cycle of iteration

– copy this instruction over the backedge

Page 24: Processor Architectures and Program Mapping

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

24

Software pipelining: Modulo scheduling

Example: Modulo scheduling a loop

for (i = 0; i < n; i++)

a[i+6] = 3* a[i] - 1;

(a) Example loop

ld r1,(r2)

mul r3,r1,3

sub r4,r3,1

st r4,(r5)

(b) Code without loop control

ld r1,(r2)

mul r3,r1,3

sub r4,r3,1

st r4,(r5)

ld r1,(r2)

mul r3,r1,3

sub r4,r3,1

st r4,(r5)

ld r1,(r2)

mul r3,r1,3

sub r4,r3,1

st r4,(r5)

ld r1,(r2)

mul r3,r1,3

sub r4,r3,1

st r4,(r5)

Prologue

Kernel

Epilogue

(c) Software pipeline

• Prologue fills the SW pipeline with iterations

• Epilogue drains the SW pipeline

Page 25: Processor Architectures and Program Mapping

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

25

Software pipelining: determine II, Initation Interval

ld r1, (r2)

mul r3, r1, 3

(0,1) (1,0)

sub r4, r3, 1

st r4, (r5)

(0,1) (1,0)

(0,1) (1,0) (1,6)

(delay, distance)

Cyclic data dependences

cycle(v) cycle(u) + delay(u,v) - II.distance(u,v)

For (i=0;.....)

A[i+6]= 3*A[i]-1

Page 26: Processor Architectures and Program Mapping

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

26

Modulo scheduling constraints

MII minimum initiation interval bounded by cyclic dependences and resources:

MII = max{ ResMII, RecMII }

Resources:

)(

)(max

ravailable

rusedResMII

resourcesr

Cycles:

ce

edistanceIIedelayvcyclevcycle )(.)()()(

Therefore:

ce

cyclesc edistanceIIedelayNIIRecMII )(.)(0,|min

Or:

ce

ce

cyclesc edistance

edelayRecMII

)(

)(max

Page 27: Processor Architectures and Program Mapping

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

27

The Role of the Compiler

9 steps required to translate an HLL program

• Front-end compilation

• Determine dependencies

• Graph partitioning: make multiple threads (or tasks)

• Bind partitions to compute nodes

• Bind operands to locations

• Bind operations to time slots: Scheduling

• Bind operations to functional units

• Bind transports to buses

• Execute operations and perform transports

Page 28: Processor Architectures and Program Mapping

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

28

Division of responsibilities between hardware and compiler

Frontend

Binding of Operands

Determine Dependencies

Scheduling

Binding of Transports

Binding of Operations

Execute

Binding of Operands

Determine Dependencies

Scheduling

Binding of Transports

Binding of Operations

Responsibility of compiler Responsibility of Hardware

Application

Superscalar

Dataflow

Multi-threaded

Indep. Arch

VLIW

TTA

Page 29: Processor Architectures and Program Mapping

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

29

Overview• Enhance performance: architecture methods• Instruction Level Parallelism• VLIW• Examples

– C6

– TM

– TTA

• Clustering• Code generation• Hands-on

Page 30: Processor Architectures and Program Mapping

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

30

Hands-on (not this year)

• Map JPEG to a TTA processor– see web page:

http://www.ics.ele.tue.nl/~heco/courses/pam

• Install TTA tools (compiler and simulator)

• Go through all listed steps

• Perform DSE: design space exploration

• Add SFU

• 1 or 2 page report in 2 weeks

Page 31: Processor Architectures and Program Mapping

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

31

Hands-on

• Let’s look at DSE: Design Space Exploration

• We will use the Imagine processor

• http://cva.stanford.edu/projects/imagine/

Page 32: Processor Architectures and Program Mapping

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

32

Mapping applications to processorsMOVE framework

Architectureparameters

OptimizerOptimizer

Parametric compilerParametric compiler Hardware generatorHardware generator

feedbackfeedback

Userintercation

Parallel object code chip

Pareto curve(solution space)

cost

exec

. tim

e

x

x

x

x

xx

x

xx

x

x

x

x

x

x

xx x

x

x

Move framework

TTA based system

Page 33: Processor Architectures and Program Mapping

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

33

Code generation trajectory for TTAs

Application (C)

Compiler frontend

Sequential code

Compiler backend

Parallel code

Sequential simulation

Parallel simulation

Arc

hite

ctur

e de

scri

ptio

n

Profiling data

Input/Output

Input/Output

• Frontend: GCC or SUIF (adapted)

• Frontend: GCC or SUIF (adapted)

Page 34: Processor Architectures and Program Mapping

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

34

Exploration: TTA resource reduction

Page 35: Processor Architectures and Program Mapping

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

35

Exporation: TTA connectivity reduction

Number of connections removed

Exe

cuti

on t

ime

Reducing bus delay

FU stage constrains cycle time

Cri

tical

con

nect

ions

dis

appe

ar

0

Page 36: Processor Architectures and Program Mapping

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

36

Can we do better

How ?

• Transformations

• SFUs: Special Function Units

• Multiple Processors

Cost

Exe

cutio

n tim

e

Page 37: Processor Architectures and Program Mapping

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

37

Transforming the specification

+

+

+

+

+

+

Based on associativity of + operationa + (b + c) = (a + b) + c

Page 38: Processor Architectures and Program Mapping

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

38

Transforming the specification

d = a * b;

e = a + d;

f = 2 * b + d;

r = f – e;

x = z + y;

r = 2*b – a;x = z + y;

<<

-

a

1 b

+

x

zy

r

Page 39: Processor Architectures and Program Mapping

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

39

Changing the architectureadding SFUs: special function units

+

+

+

+

+

+

4-input adderwhy is this faster?

Page 40: Processor Architectures and Program Mapping

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

40

Changing the architectureadding SFUs: special function units

In the extreme case put everything into one unit!

Spatial mapping- no control flow

However: no flexibility / programmability !!

Page 41: Processor Architectures and Program Mapping

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

41

SFUs: fine grain patterns• Why using fine grain SFUs:

– Code size reduction– Register file #ports reduction– Could be cheaper and/or faster– Transport reduction– Power reduction (avoid charging non-local wires)– Supports whole application domain !

Which patterns do need support?• Detection of recurring operation patterns needed

Page 42: Processor Architectures and Program Mapping

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

42

SFUs: covering results

Page 43: Processor Architectures and Program Mapping

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

43

Exploration: resulting architecture

9 buses4 RFs

4 Addercmp FUs 2 Multiplier FUs

2 Diffadd FUs

streamoutput

streaminput

Architecture for image processing• Note the reduced connectivity

Page 44: Processor Architectures and Program Mapping

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

44

Conclusions• Billions of embedded processing systems

– how to design these systems quickly, cheap, correct, low power,.... ?

– what will their processing platform look like?

• VLIWs are very powerful and flexible– can be easily tuned to application domain

• TTAs even more flexible, scalable, and lower power

Page 45: Processor Architectures and Program Mapping

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

45

Conclusions

• Compilation for ILP architectures is getting mature, and

• Enters the commercial area.

• However– Great discrepancy between available and exploitable

parallelism

• Advanced code scheduling techniques needed to exploit ILP

Page 46: Processor Architectures and Program Mapping

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

46

Bottom line: