processor architectures and program mapping

Processor Architectures and Program Mapping

TU/e 5kk10Henk Corporaal

Jef van Meerbergen

Bart Mesman

Exploiting ILPpart 2: code generation

04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

2

Overview• Enhance performance: architecture methods• Instruction Level Parallelism• VLIW• Examples

– C6

– TM

– TTA

• Clustering• Code generation• Hands-on


3

Compiler basics

• Overview– Compiler trajectory / structure / passes– Control Flow Graph (CFG)– Mapping and Scheduling– Basic block list scheduling– Extended scheduling scope– Loop schedulin


4

Compiler basics: trajectory

Preprocessor

Compiler

Assembler

Loader/Linker

Source program

Object program

Error messages

Library code


5

Compiler basics: structure / passes

Lexical analyzer

Parsing

Code optimization

Register allocation

Source code

Sequential code

Intermediate code

Code generation

Scheduling and allocation

Object code

token generation

check syntax check semantic parse tree generation

data flow analysis local optimizations global optimizationscode selection peephole optimizations

making interference graph graph coloring spill code insertion caller / callee save and restore code

exploiting ILP


6

Compiler basics: structure Simple compilation example

Lexical analyzer

Syntax analyzer

Intermediate code generator

position := initial + rate * 60

id := id + id * 60

:=

+id

*id

60id

Code optimizer

Code generator

temp1 := intoreal(60)temp2 := id3 * temp1temp3 := id2 + temp2id1 := temp3

temp1 := id3 * 60.0id1 := id2 + temp1

movf id3, r2mulf #60, r2, r2movf id2, r1addf r2, r1movf r1, id1


7

Compiler basics: Control flow graph (CFG)

C input code:

CFG: 1 sub t1, a, b bgz t1, 2, 3

4 ………….. …………..

3 rem r, b, a goto 4

2 rem r, a, b goto 4

Program, is collection of Functions, each function is collection of Basic Blocks, each BB contains set of Instructions, each instruction consists of several Transports,..

if (a > b) { r = a % b; } else { r = b % a; }


8

Mapping / Scheduling: placing operations in space and time

d = a * b;

e = a + d;

f = 2 * b + d;

r = f – e;

x = z + y;

* *

+ +

-

+

a b 2

z yd

e f

r

x

Data Dependence Graph (DDG)


9

How to map these operations?

* *

+ +

-+

a b 2

z y

d

e f

rx

Architecture constraints:• One Function Unit• All operations single cycle latency

*

*

+

+

-

+

cycle 1

2

3

4

5

6


10

How to map these operations?

* *

+ +

-+

a b 2

z y

d

e f

rx

Architecture constraints:• One Add-sub and one Mul unit• All operations single cycle latency

*

* +

+

-

+cycle 1

2

3

4

5

6

Mul Add-sub


11

There are many mapping solutions

Pareto curve(solution space)

T e

xecu

tion

x

x

x

x

xx

x

xx

x

x

x

x

x

x

xxx

x

x

xx

x

x

x

x

x

xx

x

xx

Cost0


12

Basic Block Scheduling

• Make a dependence graph

• Determine minimal length

• Determine ASAP, ALAP, and slack of each operation

• Place each operation in first cycle with sufficient resources

Note:– Scheduling order sequential

– Priority determined by used heuristic; e.g. slack


13

Basic Block Scheduling

ADD

LD

A C

y

<1,3>

<2,4>MUL

A B

z

<1,4>

ADD

ADD

SUB

NEG LD

A

B C

X

<3,3>

<4,4>

<2,2>

<2,3>

<1,1>

ASAP cycle

ALAP cycle

slack


14

Cycle based list schedulingproc Schedule(DDG = (V,E))beginproc ready = { v | (u,v) E } ready’ = ready sched = current_cycle = 0 while sched V do for each v ready’ do if ResourceConfl(v,current_cycle, sched) then cycle(v) = current_cycle sched = sched {v} endif endfor current_cycle = current_cycle + 1 ready = { v | v sched (u,v) E, u sched } ready’ = { v | v ready (u,v) E, cycle(u) + delay(u,v) current_cycle} endwhileendproc


15

Extended basic block scheduling: Code Motion

A a) add r4, r4, 4 b) beq . . .

D e) st r1, 8(r4)

C d) sub r1, r1, r2

B c) add r1, r1, r2

• Downward code motions?

— a B, a C, a D, c D, d D

• Upward code motions?

— c A, d A, e B, e C, e A


16

Extended Scheduling scope

A

C

F

B

D

E

G

A;If cond Then B Else C;D;If cond Then E Else F;G;

Code: CFG:ControlFlowGraph


17

Scheduling scopes

Trace Superblock Decision tree Hyperblock/region


18

Code movement (upwards) within regions

I

I I

add

I

source block

destination block

I

Copy needed

Intermediateblock

Check foroff-liveness

Legend:

Code movement


19

Extended basic block scheduling:Code Motion

• A dominates B A is always executed before B– Consequently:

• A does not dominate B code motion from B to A requires

code duplication

• B post-dominates A B is always executed after A– Consequently:

• B does not post-dominate A code motion from B to A is speculative

A

CB

ED

F

Q1: does C dominate E?

Q2: does C dominate D?

Q3: does F post-dominate D?

Q4: does D post-dominate B?


20

Scheduling: Loops

B C

D

A

B

C’’

D

A

C’

C B

C’’

D

A

C’

C

Loop peeling Loop unrolling

Loop Optimizations:


21

Scheduling: LoopsProblems with unrolling:

• Exploits only parallelism within sets of n iterations

• Iteration start-up latency

• Code expansion

Basic block scheduling

Basic block scheduling and unrolling

Software pipelining

reso

urc

e u

tiliz

atio

n

time


22

Software pipelining• Software pipelining a loop is:

– Scheduling the loop such that iterations start before preceding iterations have finished

Or:– Moving operations across the backedge

LD

ML

ST

LD

LD ML

LD ML ST

ML ST

ST

LD

LD ML

LD ML ST

ML ST

ST

Example: y = a.x

3 cycles/iteration Unroling

5/3 cycles/iteration

Software pipelining

1 cycle/iteration


23

Software pipelining (cont’d)Basic techniques:

• Modulo scheduling (Rau, Lam)– list scheduling with modulo resource constraints

• Kernel recognition techniques– unroll the loop

– schedule the iterations

– identify a repeating pattern

– Examples:• Perfect pipelining (Aiken and Nicolau)

• URPR (Su, Ding and Xia)

• Petri net pipelining (Allan)

• Enhanced pipeline scheduling (Ebcioğlu)– fill first cycle of iteration

– copy this instruction over the backedge


24

Software pipelining: Modulo scheduling

Example: Modulo scheduling a loop

for (i = 0; i < n; i++)

a[i+6] = 3* a[i] - 1;

(a) Example loop

ld r1,(r2)

mul r3,r1,3

sub r4,r3,1

st r4,(r5)

(b) Code without loop control

ld r1,(r2)

mul r3,r1,3

sub r4,r3,1

st r4,(r5)

ld r1,(r2)

mul r3,r1,3

sub r4,r3,1

st r4,(r5)

ld r1,(r2)

mul r3,r1,3

sub r4,r3,1

st r4,(r5)

ld r1,(r2)

mul r3,r1,3

sub r4,r3,1

st r4,(r5)

Prologue

Kernel

Epilogue

(c) Software pipeline

• Prologue fills the SW pipeline with iterations

• Epilogue drains the SW pipeline


25

Software pipelining: determine II, Initation Interval

ld r1, (r2)

mul r3, r1, 3

(0,1) (1,0)

sub r4, r3, 1

st r4, (r5)

(0,1) (1,0)

(0,1) (1,0) (1,6)

(delay, distance)

Cyclic data dependences

cycle(v) cycle(u) + delay(u,v) - II.distance(u,v)

For (i=0;.....)

A[i+6]= 3*A[i]-1


26

Modulo scheduling constraints

MII minimum initiation interval bounded by cyclic dependences and resources:

MII = max{ ResMII, RecMII }

Resources:

)(

)(max

ravailable

rusedResMII

resourcesr

Cycles:

ce

edistanceIIedelayvcyclevcycle )(.)()()(

Therefore:

ce

cyclesc edistanceIIedelayNIIRecMII )(.)(0,|min

Or:

ce

ce

cyclesc edistance

edelayRecMII

)(

)(max


27

The Role of the Compiler

9 steps required to translate an HLL program

• Front-end compilation

• Determine dependencies

• Graph partitioning: make multiple threads (or tasks)

• Bind partitions to compute nodes

• Bind operands to locations

• Bind operations to time slots: Scheduling

• Bind operations to functional units

• Bind transports to buses

• Execute operations and perform transports


28

Division of responsibilities between hardware and compiler

Frontend

Binding of Operands

Determine Dependencies

Scheduling

Binding of Transports

Binding of Operations

Execute

Binding of Operands

Determine Dependencies

Scheduling

Binding of Transports

Binding of Operations

Responsibility of compiler Responsibility of Hardware

Application

Superscalar

Dataflow

Multi-threaded

Indep. Arch

VLIW

TTA


29

Overview• Enhance performance: architecture methods• Instruction Level Parallelism• VLIW• Examples

– C6

– TM

– TTA

• Clustering• Code generation• Hands-on


30

Hands-on (not this year)

• Map JPEG to a TTA processor– see web page:

http://www.ics.ele.tue.nl/~heco/courses/pam

• Install TTA tools (compiler and simulator)

• Go through all listed steps

• Perform DSE: design space exploration

• Add SFU

• 1 or 2 page report in 2 weeks


31

Hands-on

• Let’s look at DSE: Design Space Exploration

• We will use the Imagine processor

• http://cva.stanford.edu/projects/imagine/


32

Mapping applications to processorsMOVE framework

Architectureparameters

OptimizerOptimizer

Parametric compilerParametric compiler Hardware generatorHardware generator

feedbackfeedback

Userintercation

Parallel object code chip

Pareto curve(solution space)

cost

exec

. tim

e

x

x

x

x

xx

x

xx

x

x

x

x

x

x

xx x

x

x

Move framework

TTA based system


33

Code generation trajectory for TTAs

Application (C)

Compiler frontend

Sequential code

Compiler backend

Parallel code

Sequential simulation

Parallel simulation

Arc

hite

ctur

e de

scri

ptio

n

Profiling data

Input/Output

Input/Output

• Frontend: GCC or SUIF (adapted)

• Frontend: GCC or SUIF (adapted)


34

Exploration: TTA resource reduction


35

Exporation: TTA connectivity reduction

Number of connections removed

Exe

cuti

on t

ime

Reducing bus delay

FU stage constrains cycle time

Cri

tical

con

nect

ions

dis

appe

ar

0


36

Can we do better

How ?

• Transformations

• SFUs: Special Function Units

• Multiple Processors

Cost

Exe

cutio

n tim

e


37

Transforming the specification

+

+

+

+

+

+

Based on associativity of + operationa + (b + c) = (a + b) + c


38

Transforming the specification

d = a * b;

e = a + d;

f = 2 * b + d;

r = f – e;

x = z + y;

r = 2*b – a;x = z + y;

<<

-

a

1 b

+

x

zy

r


39

Changing the architectureadding SFUs: special function units

+

+

+

+

+

+

4-input adderwhy is this faster?


40

Changing the architectureadding SFUs: special function units

In the extreme case put everything into one unit!

Spatial mapping- no control flow

However: no flexibility / programmability !!


41

SFUs: fine grain patterns• Why using fine grain SFUs:

– Code size reduction– Register file #ports reduction– Could be cheaper and/or faster– Transport reduction– Power reduction (avoid charging non-local wires)– Supports whole application domain !

Which patterns do need support?• Detection of recurring operation patterns needed


42

SFUs: covering results


43

Exploration: resulting architecture

9 buses4 RFs

4 Addercmp FUs 2 Multiplier FUs

2 Diffadd FUs

streamoutput

streaminput

Architecture for image processing• Note the reduced connectivity


44

Conclusions• Billions of embedded processing systems

– how to design these systems quickly, cheap, correct, low power,.... ?

– what will their processing platform look like?

• VLIWs are very powerful and flexible– can be easily tuned to application domain

• TTAs even more flexible, scalable, and lower power


45

Conclusions

• Compilation for ILP architectures is getting mature, and

• Enters the commercial area.

• However– Great discrepancy between available and exploitable

parallelism

• Advanced code scheduling techniques needed to exploit ILP


46

Bottom line:

processor architectures and program mapping

Documents

id1processor architectures

b goto

b dr

b bgz t1

id id

id2 temp1movf id3

function unit

sub t1