A New Dataflow Compiler IR for Accelerating ControlIntensive Code in Spatial Hardware
Ali Mustafa Zaidi, David Greaves
{alimustafa.zaidi, david.greaves}@cl.cam.ac.uk
University of Cambridge Computer Laboratory
2
The Dark Silicon Problem
2.1GHz @ 90nm (80W)
18%
5.2GHz @ 45nm (80W)
7%
7.3GHz @ 32nm (80W)
3%
Amdahl's Law
Utilization Wall
+
=
Dark Silicon45nm → 8nm (32x resources)
● CPU: 3.5x, GPU 2.4x (Cnsrv.)
● CPU: 7.9x, GPU 2.7x (ITRS)
Esmaeilzadeh et al, "Dark Silicon and the End of Multicore Scaling". IEEE Micro 2012.
3
2.1GHz @ 90nm (80W)
18%
5.2GHz @ 45nm (80W)
7%
7.3GHz @ 32nm (80W)
3%
Amdahl's Law
Utilization Wall
+
=
Dark Silicon45nm → 8nm (32x resources)
● CPU: 3.5x, GPU 2.4x (Cnsrv.)
● CPU: 7.9x, GPU 2.7x (ITRS)
Esmaeilzadeh et al, "Dark Silicon and the End of Multicore Scaling". IEEE Micro 2012.
Need for both high 'Sequential' Performance,
AND
Very High Energy Efficiency
The Dark Silicon Problem
4
0 2 4 6 8 10 120
10
20
30
40
50
60
Relative Performance
Po
we
r D
iss
ipa
tio
n
Amdahl's Law
Utilization Wall
+
=
Dark SiliconCan we achieve Superscalar Performance, w/o
Superscalar Overheads?
Esmaeilzadeh et al, "Dark Silicon and the End of Multicore Scaling". IEEE Micro 2012.
Conventional
Spatial?
Superscalar Processors
● Only Option for Seq. Performance!
● Power scales exponentially with Performance
Custom Hardware:
● 10 – 1000 x Efficiency!
● Not for Sequential!
ARM A5 SOC
Custom Video
Decoder
The Dark Silicon Problem
5
Solution: Spatial Architectures?
● Custom Hardware, FPGAs, CGRAs, MPPAs, etc.
● Advantages
– Scalable, Decentralized architectures, with short, p2p wiring.
– High Computational Density
– 101000x Energy efficiency & Performance.
● Issues
– Poor Programmability: often requiring lowlevel hardware knowledge
– Limited Amenability: poor performance on sequential, irregular, or complex controlflow code.
● Examples
– Conservation Cores: Performance ≈ inorder MIPS24KE core
– Phoenix CASH Hardware: Performance 30% less than 4way OOO Core.
6
● Key Reasons for High Performance of Complex, OOO Superscalars:
– Aggressive Controlflow Speculation
– Dynamic, outoforder execution scheduling
● Custom hardware has very limited speculation
– Single flow of control
– Ifconversion & hyperblock formation for forward branches.
– No acceleration of backwards branches!
= A[i]
> 0
A i
foo()T
F
Start
i = 0
i++
< 100
T
EndF
bar()
ControlData Flow Graph
McFarlin et al., “Discerning the dominant out-of-order performance advantage: is it speculation or dynamism?”, ASPLOS ’13
Solution: Spatial Architectures?
7
Our Solution
Instead of
CDFG IR + Compile-time Execution Scheduling
We Employ
VSFG IR + Dataflow Execution Model
ControlData Flow Graph
= A[i]
> 0
A i
foo()T
F
Start
i = 0
i++
< 100
T
EndF
bar()
Solution: Spatial Architectures!
8
Value State Flow Graph
● Hierarchical Dataflow Graph – Instead of [Basic Blocks + Control
Flow], we have [Nested Subgraphs + Dataflow]
– Functions → nested subgraphs
– Loops → tailrecursive functions.
● Dataflow execution of operations– Multiple Subgraphs may execute
concurrently in Dataflow order (unlike basic blocks).
– Exposes Multiple Flows of Control!
= A[i]
foo()> 0
F TP
i = 0 A STATE_IN
STATE_OUT
i++
< 100Next
iteration of 'for' loop
F TP
bar()
inPred
VSFG: ValueState Flow Graph
9
Value State Flow Graph
● Infinite DAG – Loops represented as Tail Recursion
– Branches represented via ifconversion
– Enables Aggressive Speculation!
● No single 'Flow of Control'– Instead, control implemented via
'Boolean Predicate Expressions'.
– Logic minimization can simplify expressions, facilitating Control Dependence Analysis!
= A[i]
foo()> 0
F TP
i = 0 A STATE_IN
STATE_OUT
i++
< 100Next
iteration of 'for' loop
F TP
bar()
inPred
VSFG: ValueState Flow Graph
10
Value State Flow Graph
● Hierarchical Dataflow Graph – Subgraphs may be 'predicated',
or executed speculatively (via 'ifconversion').
– 'Flattening' loop tailcall subgraphs → loop unrolling/pipelining.
– Multiple loops in a loopnest may be unrolled independently to expose ILP
= A[i]
foo()> 0
F TP
i = 0 A STATE_IN
STATE_OUT
i++
< 100Next
iteration of 'for' loop
F TP
bar()
inPred
VSFG: ValueState Flow Graph
11
VSFG: ValueState Flow Graph
12
VSFG: ValueState Flow Graph
13
Any High Level
LanguageLLVM VSFG
Bluespec SystemVerilog
ASIC / FPGALowLevel
IR
%1 = mul i32 %x, %y;%2 = srem i32 %1, %z;%3 = icmp slt i32 %2, %1;
FIFOF(int) x ← mkFIFOF1;FIFOF(int) y ← mkFIFOF1;FIFOF(int) z ← mkFIFOF1;FIFOF(int) srem_1 ← mkFIFOF1;FIFOF(int) icmp_1 ← mkFIFOF1;FIFOF(int) icmp_2 ← mkFIFOF1;FIFOF(int) out_3 ← mkFIFOF1;
rule mul_inst;let val1 = x.first; x.deq;let val2 = y.first; y.deq;let rslt = val1 * val2;srem_1.enq (rslt);icmp_1.enq (rslt);
endrule
rule srem_inst;let val1 = srem_1.first; srem_1.deq;let val2 = z.first; z.deq;let rslt = val1 % val2;icmp_2.enq (rslt);
endrule.
High Level Synthesis Case Study
14
%1 = mul i32 %x, %y ; <i32>%2 = srem i32 %1, %z ; <i32>%3 = icmp slt i32 %2, %1 ; <i1>
icmp
%2
srem
%1
z
mul
x y
Value-State Flow Graph
%3
→ Registers
→ Instructions
→ Petri Net Places
→ Petri Net TransitionsLLVM IR
Petri Net basedLow Level Dataflow IR
Hardware Oriented Dataflow IR
15
%1 = mul i32 %x, %y ; <i32>%2 = srem i32 %1, %z ; <i32>%3 = icmp slt i32 %2, %1 ; <i1>
→ Petri Net Places
→ Petri Net Transitions
LLVM IR FIFOF(int) x ← mkFIFOF1;FIFOF(int) y ← mkFIFOF1;FIFOF(int) z ← mkFIFOF1;FIFOF(int) srem_1 ← mkFIFOF1;FIFOF(int) icmp_1 ← mkFIFOF1;FIFOF(int) icmp_2 ← mkFIFOF1;FIFOF(int) out_3 ← mkFIFOF1;
rule mul_inst;let val1 = x.first; x.deq;let val2 = y.first; y.deq;let rslt = val1 * val2;srem_1.enq (rslt);icmp_1.enq (rslt);
endrule
rule srem_inst;let val1 = srem_1.first; srem_1.deq;let val2 = z.first; z.deq;let rslt = val1 % val2;icmp_2.enq (rslt);
endrule...
Equivalent Bluespec Code
Petri Net basedLow Level Dataflow IR
Hardware Oriented Dataflow IR
16
● LegUp– LLVM 2.9
– O2– No LTO, no LTI
– No Op Chaining– Statically Scheduled CFG
● Our Toolchain– LLVM 2.6
– O2
– No LTO, no LTI
– No Op Chaining
– Dynamically Scheduled VSFG
● Performance and Energy Evaluation by comparing with – LegUp HLS Tool, & Altera Nios IIf Processor, implemented on Altera
Stratix IV GX FPGA.
– Nehalem Core i7 (Sniper interval simulator from Intel).– In all cases, memory access latency assumed == 1 Cycle.
High Level Synthesis Case Study
17
Normalised to LegUp
Compared to Nios II/f & Intel Nehalem Core i7 (SniperSim)
Matrix Transpose(x1k cycles)
adpcm(x1k cycles)
dfsin(x1k cycles)
Neural Net Simulator (x1M cycles)
Performance (Cycle Counts)
epic* adpcm dfadd dfdiv dfmul dfsin mips bimpa GEOMEAN0
0.2
0.4
0.6
0.8
1
1.2
.99
.81
.68
1.07.97
.74
1.08
.80.88
.25
.72.66
.87
.66 .69
.97
.68 .65
LegUp (CDFG) VSFG_0 VSFG_1 VSFG_3
18
Nios IIf @ 250MHz
epic* adpcm dfadd dfdiv dfmul dfsin mips bimpa GEOMEAN0
50
100
150
200
250
300
350
400
450
392
117
204
124
184
125
185
124
167
345
109
218
167
225
169184
127
182
Frequency
Frequency & Delay
epic* adpcm dfadd dfdiv dfmul dfsin mips bimpa GEOMEAN0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.12
.87
.64
.80 .80
.54
1.09
.78 .81
.42
.00
.70 .68 .65
.00
1.14
.00
.68
Delay
LegUp (CFG) VSFG_0 VSFG_1 VSFG_3
19
epic adpcm dfadd dfdiv dfmul dfsin mips0
1
2
3
4
5
6
misspeculated activity (bits)
useful activity (bits)
epic adpcm dfadd dfdiv dfmul dfsin mips0
1
2
3
4
5
6
LegUp
VSFG_0_Eff
Power estimation assuming 250MHz
operating frequency
Power & Speculation Overheads
20
epic adpcm dfadd dfdiv dfmul dfsin mips0
2
4
6
8
10
12
LegUp
VSFG_0_Eff
VSFG_1_Eff
VSFG_3_Eff
Power estimation assuming 250MHz
operating frequency
epic adpcm dfadd dfdiv dfmul dfsin mips0
1
2
3
4
5
6
misspeculated activity (bits)
useful activity (bits)
Power & Speculation Overheads
21
epic adpcm dfadd dfdiv dfmul dfsin mips GEOMEAN0.1
1
10
100
1 1
3
1
32 2
213
5
2
43 3 33
63
75 4
62
1
17 1831
14
6
12
LegUp VSFG_0 VSFG_1 VSFG_3 Nios
Normalized Energy
22
● Energy Cost Comparison:– vs Nios II/f: 0.25 x (GEOMEAN)
– vs LegUp: 34 x (GEOMEAN)
● Overheads of Speculation– Balance between speculation & predication must be found for efficiency &
performance
● Part of power dissipation proportional to Area– Clock Gating for predicated regions to reduce dynamic power
● (consider asynchronous Ckts)
– Power gating for predicated regions to reduce static power?
– Selective loop unrolling.
Sources of Energy Inefficiency
23
● 35% better performance than statically scheduled CFG, without any optimizations:– Improvements due to dynamic scheduling, MFC & CDA– Unrolling helps, but speedup saturates quickly.
● Further Improvements possible:– Balance between predication & speculation, to improve speedup without
unrolling (thus reducing area and energy costs)
– Stateedge is on critical path – limits both unrolling & MFC.● Last remnant of 'sequential' nature of program.
● Frequency Scaling limited by Memory Interconnect– Partition memory & pipeline memory access tree
Current Performance Limitations
24
Increasing Programmer / Compiler
Effort
Alias Analysis
Specul.Loads
OpenMP Assertion: Implicit (determinstic)parallel programming models are essentially means of partitioning the stateedge.
OpenCL
Sieve C++
Implicit Parallelism & Stateedge Partitioning
SpMT /TLS
DynamicOOO LSQ
Increasing Runtime Effort
25
Thank you for listening!
Questions &/or
Comments?
26
ControlData Flow Graph
Value State Flow Graph
Overcoming ControlFlow with the VSFG
= A[i]
> 0
A i
foo()T
F
Start
i = 0
i++
< 100
T
EndF
bar()
= A[i]
foo()> 0
F TP
i = 0 A STATE_IN
STATE_OUT
i++
< 100Next
iteration of 'for' loop
F TP
bar()
inPred
27
Performance (Cycle Counts)
● Cycle counts normalized to LegUp results● VSFG implemented with all loops unrolled 0, 1, and 3 times● Full Speculation: all subgraphs (except loops) triggered
without predicates
epic adpcm dfadd dfdiv dfmul dfsin mips** small_bimpa0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Cycle Counts with Full Speculation
LegUp (CFG)
VSFG_0
VSFG_1
VSFG_3
28
Performance (Cycle Counts)
Predication: only one block
will execute
Speculation: both blocks execute, but
only one result is chosen
epic adpcm dfadd dfdiv dfmul dfsin mips** small_bimpa0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Cycle Counts with Full Speculation
LegUp (CFG)
VSFG_0
VSFG_1
VSFG_3
29
Performance (Cycle Counts)
epic adpcm dfadd dfdiv dfmul dfsin mips** small_bimpa0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Cycle Counts with Full Speculation
LegUp (CFG)
VSFG_0
VSFG_1
VSFG_3
30
epic adpcm dfadd dfdiv dfmul dfsin mips** small_bimpa0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Cycle Counts with Full Speculation
LegUp (CFG)
VSFG_0
VSFG_1
VSFG_3
Performance (Cycle Counts)
epic adpcm dfadd dfdiv dfmul dfsin mips** small_bimpa0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Cycle Counts with Predicated Subgraphs
LegUp (CFG)
VSFG_0
VSFG_1
VSFG_3
31
Core i7 Nios 2f LegUp VSFG_0 VSFG_1 VSFG_30
20000000
40000000
60000000
80000000
100000000
120000000
140000000
160000000
180000000
200000000
39664956
373347552
142386696
11436149498179648 97430648
small_bimpa
Core i7 Nios 2f LegUp VSFG_0 VSFG_1 VSFG_30
20000
40000
60000
80000
100000
120000
140000
104953
1420558
105773
72007 71896 71896
dfsin
Core i7 Nios 2f LegUp VSFG_0 VSFG_1 VSFG_30
200000
400000
600000
800000
1000000
1200000
1400000
200174
3399634
1078444 1062436
528218
265170
epic
Core i7 Nios 2f LegUp VSFG_0 VSFG_1 VSFG_30
10000
20000
30000
40000
50000
60000
70000
80000
42662
119794
71349
5786051580 51186
adpcm
Performance (Cycle Counts)
32
Core i7 Nios 2f LegUp VSFG_0 VSFG_1 VSFG_30
2000
4000
6000
8000
10000
12000
14000
16000
1800015994 16441
2391 1999 1590 1574
dfadd
Core i7 Nios 2f LegUp VSFG_0 VSFG_1 VSFG_30
5000
10000
15000
20000
25000
30000
35000
40000
15120
36487
3029 3235 2825 2639
dfdiv
Core i7 Nios 2f LegUp VSFG_0 VSFG_1 VSFG_30
2000
4000
6000
8000
10000
12000
14000
1600014072
7074
941 916 671 625
dfmul
Core i7 Nios 2f LegUp VSFG_0 VSFG_1 VSFG_30
5000
10000
15000
20000
25000
30000
35000
29998 31082
13414 14489 13438 12953
mips**
Performance (Cycle Counts)
33
Understanding OOO Performance
● Control flow is the primary constraint on ILP
– Wall (1991): Conventional processors limited to ILP of 4-8!
● Single Flow of control● Branch prediction (+95% accuracy)
– Lam & Wilson (1993), Mak & Mycroft (2009): 10x ILP possible, with:
● Control Dependence Analysis (CDA)● Multiple Flows of Control (MFC)
● Custom hardware has very limited speculation
– Single flow of control
– Ifconversion & hyperblock formation for forward branches.
– No acceleration of backwards branches!
= A[i]
> 0
A i
foo()T
F
Start
i = 0
i++
< 100
T
EndF
bar()
ControlData Flow Graph