Download - A New Dataflow Compiler IR for Accelerating ControlIntensive … · A New Dataflow Compiler IR for Accelerating ControlIntensive Code in Spatial Hardware Ali Mustafa Zaidi, David

A New Dataflow Compiler IR for Accelerating ControlIntensive Code in Spatial Hardware

Ali Mustafa Zaidi, David Greaves

{alimustafa.zaidi, david.greaves}@cl.cam.ac.uk

University of Cambridge Computer Laboratory

2

The Dark Silicon Problem

2.1GHz @ 90nm (80W)

18%

5.2GHz @ 45nm (80W)

7%

7.3GHz @ 32nm (80W)

3%

Amdahl's Law

Utilization Wall

+

=

Dark Silicon45nm → 8nm (32x resources)

● CPU: 3.5x, GPU 2.4x (Cnsrv.)

● CPU: 7.9x, GPU 2.7x (ITRS)

Esmaeilzadeh et al, "Dark Silicon and the End of Multicore Scaling". IEEE Micro 2012.

3

2.1GHz @ 90nm (80W)

18%

5.2GHz @ 45nm (80W)

7%

7.3GHz @ 32nm (80W)

3%

Amdahl's Law

Utilization Wall

+

=

Dark Silicon45nm → 8nm (32x resources)

● CPU: 3.5x, GPU 2.4x (Cnsrv.)

● CPU: 7.9x, GPU 2.7x (ITRS)


Need for both high 'Sequential' Performance,

AND

Very High Energy Efficiency


4

0 2 4 6 8 10 120

10

20

30

40

50

60

Relative Performance

Po

we

r D

iss

ipa

tio

n

Amdahl's Law

Utilization Wall

+

=

Dark SiliconCan we achieve Superscalar Performance, w/o

Superscalar Overheads?


Conventional

Spatial?

Superscalar Processors

● Only Option for Seq. Performance!

● Power scales exponentially with Performance

Custom Hardware:

● 10 – 1000 x Efficiency!

● Not for Sequential!

ARM A5 SOC

Custom Video

Decoder


5

Solution: Spatial Architectures?

● Custom Hardware, FPGAs, CGRAs, MPPAs, etc.

● Advantages

– Scalable, Decentralized architectures, with short, p2p wiring.

– High Computational Density

– 101000x Energy efficiency & Performance.

● Issues

– Poor Programmability: often requiring lowlevel hardware knowledge

– Limited Amenability: poor performance on sequential, irregular, or complex controlflow code.

● Examples

– Conservation Cores: Performance ≈ inorder MIPS24KE core

– Phoenix CASH Hardware: Performance 30% less than 4way OOO Core.

6

● Key Reasons for High Performance of Complex, OOO Superscalars:

– Aggressive Controlflow Speculation

– Dynamic, outoforder execution scheduling

● Custom hardware has very limited speculation

– Single flow of control

– Ifconversion & hyperblock formation for forward branches.

– No acceleration of backwards branches!

= A[i]

> 0

A i

foo()T

F

Start

i = 0

i++

< 100

T

EndF

bar()

ControlData Flow Graph

McFarlin et al., “Discerning the dominant out-of-order performance advantage: is it speculation or dynamism?”, ASPLOS ’13

Solution: Spatial Architectures?

7

Our Solution

Instead of

CDFG IR + Compile-time Execution Scheduling

We Employ

VSFG IR + Dataflow Execution Model


= A[i]

> 0

A i

foo()T

F

Start

i = 0

i++

< 100

T

EndF

bar()

Solution: Spatial Architectures!

8

Value State Flow Graph

● Hierarchical Dataflow Graph – Instead of [Basic Blocks + Control

Flow], we have [Nested Subgraphs + Dataflow]

– Functions → nested subgraphs

– Loops → tailrecursive functions.

● Dataflow execution of operations– Multiple Subgraphs may execute

concurrently in Dataflow order (unlike basic blocks).

– Exposes Multiple Flows of Control!

= A[i]

foo()> 0

F TP

i = 0 A STATE_IN

STATE_OUT

i++

< 100Next

iteration of 'for' loop

F TP

bar()

inPred

VSFG: ValueState Flow Graph

9


● Infinite DAG – Loops represented as Tail Recursion

– Branches represented via ifconversion

– Enables Aggressive Speculation!

● No single 'Flow of Control'– Instead, control implemented via

'Boolean Predicate Expressions'.

– Logic minimization can simplify expressions, facilitating Control Dependence Analysis!

= A[i]

foo()> 0

F TP

i = 0 A STATE_IN

STATE_OUT

i++

< 100Next


F TP

bar()

inPred


10


● Hierarchical Dataflow Graph – Subgraphs may be 'predicated',

or executed speculatively (via 'ifconversion').

– 'Flattening' loop tailcall subgraphs → loop unrolling/pipelining.

– Multiple loops in a loopnest may be unrolled independently to expose ILP

= A[i]

foo()> 0

F TP

i = 0 A STATE_IN

STATE_OUT

i++

< 100Next


F TP

bar()

inPred


11


12


13

Any High Level

LanguageLLVM VSFG

Bluespec SystemVerilog

ASIC / FPGALowLevel

IR

%1 = mul i32 %x, %y;%2 = srem i32 %1, %z;%3 = icmp slt i32 %2, %1;

FIFOF(int) x ← mkFIFOF1;FIFOF(int) y ← mkFIFOF1;FIFOF(int) z ← mkFIFOF1;FIFOF(int) srem_1 ← mkFIFOF1;FIFOF(int) icmp_1 ← mkFIFOF1;FIFOF(int) icmp_2 ← mkFIFOF1;FIFOF(int) out_3 ← mkFIFOF1;

rule mul_inst;let val1 = x.first; x.deq;let val2 = y.first; y.deq;let rslt = val1 * val2;srem_1.enq (rslt);icmp_1.enq (rslt);

endrule

rule srem_inst;let val1 = srem_1.first; srem_1.deq;let val2 = z.first; z.deq;let rslt = val1 % val2;icmp_2.enq (rslt);

endrule.

High Level Synthesis Case Study

14

%1 = mul i32 %x, %y ; <i32>%2 = srem i32 %1, %z ; <i32>%3 = icmp slt i32 %2, %1 ; <i1>

icmp

%2

srem

%1

z

mul

x y

Value-State Flow Graph

%3

→ Registers

→ Instructions

→ Petri Net Places

→ Petri Net TransitionsLLVM IR

Petri Net basedLow Level Dataflow IR

Hardware Oriented Dataflow IR

15

%1 = mul i32 %x, %y ; <i32>%2 = srem i32 %1, %z ; <i32>%3 = icmp slt i32 %2, %1 ; <i1>

→ Petri Net Places

→ Petri Net Transitions

LLVM IR FIFOF(int) x ← mkFIFOF1;FIFOF(int) y ← mkFIFOF1;FIFOF(int) z ← mkFIFOF1;FIFOF(int) srem_1 ← mkFIFOF1;FIFOF(int) icmp_1 ← mkFIFOF1;FIFOF(int) icmp_2 ← mkFIFOF1;FIFOF(int) out_3 ← mkFIFOF1;

rule mul_inst;let val1 = x.first; x.deq;let val2 = y.first; y.deq;let rslt = val1 * val2;srem_1.enq (rslt);icmp_1.enq (rslt);

endrule

rule srem_inst;let val1 = srem_1.first; srem_1.deq;let val2 = z.first; z.deq;let rslt = val1 % val2;icmp_2.enq (rslt);

endrule...

Equivalent Bluespec Code

Petri Net basedLow Level Dataflow IR

Hardware Oriented Dataflow IR

16

● LegUp– LLVM 2.9

– O2– No LTO, no LTI

– No Op Chaining– Statically Scheduled CFG

● Our Toolchain– LLVM 2.6

– O2

– No LTO, no LTI

– No Op Chaining

– Dynamically Scheduled VSFG

● Performance and Energy Evaluation by comparing with – LegUp HLS Tool, & Altera Nios IIf Processor, implemented on Altera

Stratix IV GX FPGA.

– Nehalem Core i7 (Sniper interval simulator from Intel).– In all cases, memory access latency assumed == 1 Cycle.

High Level Synthesis Case Study

17

Normalised to LegUp

Compared to Nios II/f & Intel Nehalem Core i7 (SniperSim)

Matrix Transpose(x1k cycles)

adpcm(x1k cycles)

dfsin(x1k cycles)

Neural Net Simulator (x1M cycles)

Performance (Cycle Counts)

epic* adpcm dfadd dfdiv dfmul dfsin mips bimpa GEOMEAN0

0.2

0.4

0.6

0.8

1

1.2

.99

.81

.68

1.07.97

.74

1.08

.80.88

.25

.72.66

.87

.66 .69

.97

.68 .65

LegUp (CDFG) VSFG_0 VSFG_1 VSFG_3

18

Nios IIf @ 250MHz


50

100

150

200

250

300

350

400

450

392

117

204

124

184

125

185

124

167

345

109

218

167

225

169184

127

182

Frequency

Frequency & Delay


0.2

0.4

0.6

0.8

1

1.2

1.4

1.12

.87

.64

.80 .80

.54

1.09

.78 .81

.42

.00

.70 .68 .65

.00

1.14

.00

.68

Delay

LegUp (CFG) VSFG_0 VSFG_1 VSFG_3

19

epic adpcm dfadd dfdiv dfmul dfsin mips0

1

2

3

4

5

6

misspeculated activity (bits)

useful activity (bits)


1

2

3

4

5

6

LegUp

VSFG_0_Eff

Power estimation assuming 250MHz

operating frequency

Power & Speculation Overheads

20


2

4

6

8

10

12

LegUp

VSFG_0_Eff

VSFG_1_Eff

VSFG_3_Eff

Power estimation assuming 250MHz

operating frequency


1

2

3

4

5

6

misspeculated activity (bits)

useful activity (bits)

Power & Speculation Overheads

21

epic adpcm dfadd dfdiv dfmul dfsin mips GEOMEAN0.1

1

10

100

1 1

3

1

32 2

213

5

2

43 3 33

63

75 4

62

1

17 1831

14

6

12

LegUp VSFG_0 VSFG_1 VSFG_3 Nios

Normalized Energy

22

● Energy Cost Comparison:– vs Nios II/f: 0.25 x (GEOMEAN)

– vs LegUp: 34 x (GEOMEAN)

● Overheads of Speculation– Balance between speculation & predication must be found for efficiency &

performance

● Part of power dissipation proportional to Area– Clock Gating for predicated regions to reduce dynamic power

● (consider asynchronous Ckts)

– Power gating for predicated regions to reduce static power?

– Selective loop unrolling.

Sources of Energy Inefficiency

23

● 35% better performance than statically scheduled CFG, without any optimizations:– Improvements due to dynamic scheduling, MFC & CDA– Unrolling helps, but speedup saturates quickly.

● Further Improvements possible:– Balance between predication & speculation, to improve speedup without

unrolling (thus reducing area and energy costs)

– Stateedge is on critical path – limits both unrolling & MFC.● Last remnant of 'sequential' nature of program.

● Frequency Scaling limited by Memory Interconnect– Partition memory & pipeline memory access tree

Current Performance Limitations

24

Increasing Programmer / Compiler

Effort

Alias Analysis

Specul.Loads

OpenMP Assertion: Implicit (determinstic)parallel programming models are essentially means of partitioning the stateedge.

OpenCL

Sieve C++

Implicit Parallelism & Stateedge Partitioning

SpMT /TLS

DynamicOOO LSQ

Increasing Runtime Effort

25

Thank you for listening!

Questions &/or

Comments?

26



Overcoming ControlFlow with the VSFG

= A[i]

> 0

A i

foo()T

F

Start

i = 0

i++

< 100

T

EndF

bar()

= A[i]

foo()> 0

F TP

i = 0 A STATE_IN

STATE_OUT

i++

< 100Next


F TP

bar()

inPred

27


● Cycle counts normalized to LegUp results● VSFG implemented with all loops unrolled 0, 1, and 3 times● Full Speculation: all subgraphs (except loops) triggered

without predicates

epic adpcm dfadd dfdiv dfmul dfsin mips** small_bimpa0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

Cycle Counts with Full Speculation

LegUp (CFG)

VSFG_0

VSFG_1

VSFG_3

28


Predication: only one block

will execute

Speculation: both blocks execute, but

only one result is chosen


0.2

0.4

0.6

0.8

1

1.2

1.4

1.6


LegUp (CFG)

VSFG_0

VSFG_1

VSFG_3

29



0.2

0.4

0.6

0.8

1

1.2

1.4

1.6


LegUp (CFG)

VSFG_0

VSFG_1

VSFG_3

30


0.2

0.4

0.6

0.8

1

1.2

1.4

1.6


LegUp (CFG)

VSFG_0

VSFG_1

VSFG_3



0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

Cycle Counts with Predicated Subgraphs

LegUp (CFG)

VSFG_0

VSFG_1

VSFG_3

31

Core i7 Nios 2f LegUp VSFG_0 VSFG_1 VSFG_30

20000000

40000000

60000000

80000000

100000000

120000000

140000000

160000000

180000000

200000000

39664956

373347552

142386696

11436149498179648 97430648

small_bimpa


20000

40000

60000

80000

100000

120000

140000

104953

1420558

105773

72007 71896 71896

dfsin


200000

400000

600000

800000

1000000

1200000

1400000

200174

3399634

1078444 1062436

528218

265170

epic


10000

20000

30000

40000

50000

60000

70000

80000

42662

119794

71349

5786051580 51186

adpcm


32


2000

4000

6000

8000

10000

12000

14000

16000

1800015994 16441

2391 1999 1590 1574

dfadd


5000

10000

15000

20000

25000

30000

35000

40000

15120

36487

3029 3235 2825 2639

dfdiv


2000

4000

6000

8000

10000

12000

14000

1600014072

7074

941 916 671 625

dfmul


5000

10000

15000

20000

25000

30000

35000

29998 31082

13414 14489 13438 12953

mips**


33

Understanding OOO Performance

● Control flow is the primary constraint on ILP

– Wall (1991): Conventional processors limited to ILP of 4-8!

● Single Flow of control● Branch prediction (+95% accuracy)

– Lam & Wilson (1993), Mak & Mycroft (2009): 10x ILP possible, with:

● Control Dependence Analysis (CDA)● Multiple Flows of Control (MFC)

● Custom hardware has very limited speculation

– Single flow of control

– Ifconversion & hyperblock formation for forward branches.

– No acceleration of backwards branches!

= A[i]

> 0

A i

foo()T

F

Start

i = 0

i++

< 100

T

EndF

bar()


Download - A New Dataflow Compiler IR for Accelerating ControlIntensive … · A New Dataflow Compiler IR for Accelerating ControlIntensive Code in Spatial Hardware Ali Mustafa Zaidi, David

Top Related