synthesis of custom processors based on extensible platforms fei sun +, srivaths ravi ++, anand...
Post on 21-Dec-2015
220 views
TRANSCRIPT
Synthesis of Custom Processors based on Extensible Platforms
Fei Sun+, Srivaths Ravi++, Anand Raghunathan++ and Niraj K. Jha+
+: Dept. of Electrical EngineeringPrinceton University
++: NEC Laboratories America, Inc.
Outline SoC design constraints Background
Previous work in ASIP design Xtensa platform Manual custom instruction generation procedure
Automatic custom instruction generation flow
Experimental results Conclusions
SoC Design Constraints
Time to market Cost Performance Power Cost-performance trade-off Flexibility ……
Comparison of Different Approaches
ASIC ASIP GPPTime to market -- + ++Cost ++ + --Performance ++ + --Power ++ + --Cost-performance ++ + --Flexibility -- + ++
++ Very good + Good -- Very bad
Domain Specific
Processor (DSP)
General Embedded
Processor0.1 - 1 MIPS/mw
1 - 10 MIPS/mw
50 - 100 MIPS/mw
500 - 1000 MOPS/mw
En
erg
y E
ffic
ien
cy
Fle
xib
ilit
y
ASIC
ASIP (Xtensa)
Domain Specific
Processor (AMD-K6E)0.1 - 1 MIPS/mW
1 - 10 MIPS/mW
50 - 100 MIPS/mW
500 - 1000 MOPS/mW
En
erg
y E
ffic
ien
cy
Fle
xib
ilit
y
Flexibility vs. Energy Efficiency
Previous Work in ASIP Design ASIP architectures and overall design metho
dologies [Huang, 1994], [Adams, 1996], [Fisher, 1999], [K
ucukcakar, 1999] Application-specific instruction set selection
[Choi, 1999], [Gschwind, 1999], [Arnold, 1999] Low power ASIP design
[Kalambur, 1997], [Dougherty, 1999], [Ishihara, 2000], [Sami, 2001]
Commercial offerings Xtensa, ARCtangent, Jazz, SP-5flex, Carmel
Pro
ce
ss
or
Co
ntr
ols
TRACE Port
JTAG Tap Control
On Chip Debug
Align and Decode
Coprocessor Register File
Coprocessor Execution Units
Window Register File
ALU & Address Generation
MAC 16
Designer Defined Instruction
Execution Unit
Instruction Memory or Cache & Tags
Branch Logic & Instruction Fetch
Date Memory or Cache &Tags
Pro
ce
ss
or In
terfa
ce
Write Buffer
Timers 1 to n
Special Function Register AccessData Address Watch 0 to n
Instruction Address Watch 0 to n
Instruction
Base ISA Feature
Configurable Function
Optional Function
Configurable & Optional Function
Extensible
Data
Instruction Address
Data Address
Exception Support
Interrupt Control
Me
mo
ry P
rote
cti
on
Un
itSource:
www.tensilica.com
Xtensa Architecture
Xtensa Processor Design Flow
Processor Configuration Inputs
Designer-DefinedInstruction Descriptions
Configuration File
Configured GNUC/C++ Compiler Configured GNU
Assembler/Disassembler
Configured Instruction Set
Simulator/Emulator
Configured Processor HDL
Area, Power and Timing Estimation
Logic Synthesis (Synopsys or Ambit)
Block Place/Route (Avant! Or Cadence)
Timing Verification
Hardware Profile
Application Specific Compile, Assemble, Link
Application Simulation with ISS and/or Emulator
Software Debugging/Profiling
Application Source Code
Sample Application Data
Optimized SoftwareOptimized Hardware
Generator Output
Internal Database
Design data
Use of Generated Data
Source:www.tensilica.com
Manual Custom Instruction Generation Procedure
Identify potential new instructions
Describe custom instructions
Insert custom instructions
Verify functional correctness
Profile, read source code
Understand source code
Rewrite source code
Slow
and error-prone
Contributions of Our Work
Automatic custom instruction selection Application program to extensible processors
with custom instructions Features
Efficient design space search Use accurate information from instruction set
simulator and synthesis Bridge the gap between automatic synthesized
and manually designed architectures
Automatic Custom Instruction Generation Flow
Profile C programGenerate program dependence graphs
Rank control blocks Generate templates
Select templates
Select custom instr combination
Generate custom instr combination
Build processor
Synthesize custom instr combination
Clock period/areaconstraints met?
Next instr combinationN
Profile C with instrcombination
Y
Application program (C)
1 2
34
5
14
15
16
17
18
19
Aristotle analysis system Profiler (xt-gprof)
Synthesize processor20
Generate individual custom instr
6 - 13
Automatic Custom Instruction Generation Flow
Profile C programGenerate program dependence graphs
Rank control blocks Generate templates
Select templates
Select custom instr combination
Generate custom instr combination
Build processor
Synthesize custom instr combination
Clock period/areaconstraints met?
Next instr combinationN
Profile C with instrcombination
Y
Application program (C)
1 2
34
5
14
15
16
17
18
19
Aristotle analysis system Profiler (xt-gprof)
Synthesize processor20
Generate individual custom instr
6 - 13
Example Illustration of Template Generation
c = a & 0xff; // node 1d = b & 0xff + c; // node 2e = d << 24; // node 3g = f & 0xff00; // node 4
2
1
3
4
0.03
0.03
0.030.06
a
fb c
d
e
g
Example Illustration of Template Generation
c = a & 0xff; // node 1d = b & 0xff + c; // node 2e = d << 24; // node 3g = f & 0xff00; // node 4
2
1
3
4
0.03
0.03
0.030.06
a
fb c
d
e
g2
1
3
4
0.03
0.03
0.030.06
a
fb c
d
e
g
c = a & 0xff; // node 1d = b & 0xff + c; // node 2e = d << 24; // node 3g = f & 0xff00; // node 4
1 2 3 4
2
1
3
4
0.03
0.03
0.030.06
a
fb c
d
e
g
Example Illustration of Template Generation
c = a & 0xff; // node 1d = b & 0xff + c; // node 2e = d << 24; // node 3g = f & 0xff00; // node 4
1
2
3
4
Basic templates
2
3
1 2 3 4
Example Illustration of Template Generation
Basic templates
1
2
3
Dependent templates
1
2
2
1
3
4
0.03
0.03
0.030.06
a
fb c
d
e
g
c = a & 0xff; // node 1d = b & 0xff + c; // node 2e = d << 24; // node 3g = f & 0xff00; // node 4
1 2 3 4
Example Illustration of Template Generation
Basic templates
1
2
3
1
2
2
3
2 4 3 4
1
24
2
3
4
1
2
3
4
1 4
Dependent templates
Independent templates2
1
3
4
0.03
0.03
0.030.06
a
fb c
d
e
g
c = a & 0xff; // node 1d = b & 0xff + c; // node 2e = d << 24; // node 3g = f & 0xff00; // node 4
Key Observations for Pruning
Higher the weight of the template, higher the potential for improvement --- Amdahl’s law
Scope for optimization determined by computation --- No. of cycles needed for executing the template
Scope for optimization determined by read/write ports limitation --- Additional cycles needed for extra reading/writing of input/output variables
Pruning Algorithm
Ranking criterion:
OriginalTime: Fraction of the total execution time of the original program spent in the template (weight)
In, Out: Number of inputs and outputs of the template, respectively
α, β: Number of inputs/outputs encoded in the instruction γ: No. of cycles needed for executing the template
Higher priority means greater potential for speed up
12.7312.73
12.73
Template Generation with Pruning
10.51
7.92
4.05
2.13
Ranked pool of seed templates 12.73
Highest priority
5.36 1.18 16.35
Threshold: 0.1
Template set
4.05
2.13
10.51
7.92
5.36
10.51
7.92
4.05
2.13
Template Generation with Pruning
12.73
Highest priority
5.36 1.18 16.35
12.73
Threshold: 0.1
Template set
Ranked pool of seed templates
12.73
4.05
2.13
10.51
7.92
5.36
Template Generation with Pruning
12.73
Highest priority
1.18 16.35
1.18
Threshold: 0.1
Template set
Ranked pool of seed templates
4.05
2.13
10.51
7.92
5.36
16.35
12.7316.35
Template Generation with Pruning
12.73
Highest priority
16.35
16.35
4.05
2.13
10.51
7.92
5.36
Threshold: 0.1
Template set
Ranked pool of seed templates
No. of Templates vs. Threshold Ratio
Automatic Custom Instruction Generation Flow
Profile C programGenerate program dependence graphs
Rank control blocks Generate templates
Select templates
Select custom instr combination
Generate custom instr combination
Build processor
Synthesize custom instr combination
Clock period/areaconstraints met?
Next instr combinationN
Profile C with instrcombination
Y
Application program (C)
1 2
34
5
14
15
16
17
18
19
Aristotle analysis system Profiler (xt-gprof)
Synthesize processor20
Generate individual custom instr6 - 13
Automatic Custom Instruction Generation Flow (Contd.)
All templatesbuilt?
N
Y
Extract templates
Generate custom instr
Generate RTLVerilog
SynthesizeVerilog
Profile C with custominstr
Clock periodconstraint met?
Insert custominstr
TIE compiler
Synopsys design compiler
Y
N
Increase number of cycles
or increase clock period
Next tem
plate
5
6
7
8
9
10
11
12
13
Select templates
Generate individualcustom instr
6 - 13
Automatic Custom Instruction Generation Flow (Contd.)
All templatesbuilt?
N
Y
Extract templates
Generate custom instr
Generate RTLVerilog
SynthesizeVerilog
Profile C with custominstr
Clock periodconstraint met?
Insert custominstr
TIE compiler
Synopsys design compiler
Y
N
Increase number of cycles
or increase clock period
Next tem
plate
5
6
7
8
9
10
11
12
13
Select templates
Generate individualcustom instr
6 - 13
Custom Instruction Insertion
Care must be taken to insert custom instructions into appropriate places without affecting program’s functional correctness
If custom instructions need extra inputs (outputs), care must be taken to select appropriate variables to write to (read from) user-defined registers
Example Illustration of Custom Instruction Insertion
1
4
3
5
2
3
4
1,2,5
(a) (b)
t = s >> 24; // 1r = t & 0xff; // 2a[5] = t + d; // 3m = b[0]; // 4y = x + m; // 5
m = b[0]; // 4y = CustomInstr(s,m); //1,2,5t = RUR(0); //1,2,5a[5] = t + d; // 3
Example Illustration of Custom Instruction Insertion (Contd.)
(a) (b)
....offset = t + 1;
for (i=0; i<100; i++){ j = ....
result = offset + i * j;}....
....offset = t + 1;
for (i=0; i<100; i++){ j = ....
result = CustomInstr(i,j); }....
WUR(offset,0);
Automatic Custom Instruction Generation Flow
Profile C programGenerate program dependence graphs
Rank control blocks Generate templates
Select templates
Select custom instr combination
Generate custom instr combination
Build processor
Synthesize custom instr combination
Clock period/areaconstraints met?
Next instr combinationN
Profile C with instrcombination
Y
Application program (C)
1 2
34
5
14
15
16
17
18
19
Aristotle analysis system Profiler (xt-gprof)
Synthesize processor20
Generate individual custom instr
6 - 13
Custom Instruction Combination Selection --- Problem Statement
Given a set of non-overlapping custom instructions, with each instruction having several versions, find a version for each instruction such that performance is maximized while area is under a certain threshold
Custom Instruction Combination Selection --- Flow ChartStart
All instrsanalyzed?
Add currentversion of currentinstr to solution
Performance upperbound is among the
best?
Area meetsconstraint?
All versionsconsidered?
Stop
Performance isamong the best?
Update bestsolutions
N Y
Y
Y
Y
Y
NNNext
version
Next instruction(recursive call)
Start
All instrsanalysized?
Add currentversion of
current instr insolution
Performance upbound is among the
best?
Area is undermaximum?
All versionsconsidered?
Stop
Performance isamong the best?
Update bestsolutions
N Y
Y
Y
Y
Y
NNNext
version
Next instructionrecursive call
Start
Allinstrsanalysized?
Addcurrentversion
ofcurrentinstr insolution
Performanceup bound isamong thebest?
Areais
undermaximum?
Allversions
considered?
Stop
Performance is
among thebest?
Updatebest
solutions
N Y
Y
Y
Y
Y
N
N
Next
version
Nextinstructionrecursive
call
N
N
Automatic Custom Instruction Generation Flow
Profile C programGenerate program dependence graphs
Rank control blocks Generate templates
Select templates
Select custom instr combination
Generate custom instr combination
Build processor
Synthesize custom instr combination
Clock period/areaconstraints met?
Next instr combinationN
Profile C with instrcombination
Y
Application program (C)
1 2
34
5
14
15
16
17
18
19
Aristotle analysis system Profiler (xt-gprof)
Synthesize processor20
Generate individual custom instr
6 - 13
Experimental MethodologyC Program
Automatic Custom Instruction Generation
Aristotle
Xtensa TIE Compiler
Synopsys Design Compiler
Xtensa GNU Profiler
Custom Processor(HDL Description)
NECCB11
TIE
Tensilica Processor Generator
Synopsys Design Compiler
Modified C program
Cross Compiler
ISS
Sente Wattwatcher
Area Clock Period
Execution Cycles
Power
Experimental Results (Contd.)
Average
Performance improvement: 3.4X Energy reduction: 3.2X
Energy*delay reduction: 12.6X Area increase: 1.8%
Conclusions
Automatic custom instruction synthesis for ASIPs Template generation/selection Custom instruction insertion Custom instruction combination selection
Experimental results 3.4X average performance improvement 12.6X average energy*delay reduction