synthesis of custom processors based on extensible platforms fei sun +, srivaths ravi ++, anand...

37
Synthesis of Custom Processors based on Extensible Platforms Fei Sun + , Srivaths Ravi ++ , Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical Enginee ring Princeton University ++ : NEC Laboratories America, Inc.

Post on 21-Dec-2015

220 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Synthesis of Custom Processors based on Extensible Platforms Fei Sun +, Srivaths Ravi ++, Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical

Synthesis of Custom Processors based on Extensible Platforms

Fei Sun+, Srivaths Ravi++, Anand Raghunathan++ and Niraj K. Jha+

+: Dept. of Electrical EngineeringPrinceton University

++: NEC Laboratories America, Inc.

Page 2: Synthesis of Custom Processors based on Extensible Platforms Fei Sun +, Srivaths Ravi ++, Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical

Outline SoC design constraints Background

Previous work in ASIP design Xtensa platform Manual custom instruction generation procedure

Automatic custom instruction generation flow

Experimental results Conclusions

Page 3: Synthesis of Custom Processors based on Extensible Platforms Fei Sun +, Srivaths Ravi ++, Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical

SoC Design Constraints

Time to market Cost Performance Power Cost-performance trade-off Flexibility ……

Page 4: Synthesis of Custom Processors based on Extensible Platforms Fei Sun +, Srivaths Ravi ++, Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical

Comparison of Different Approaches

ASIC ASIP GPPTime to market -- + ++Cost ++ + --Performance ++ + --Power ++ + --Cost-performance ++ + --Flexibility -- + ++

++ Very good + Good -- Very bad

Page 5: Synthesis of Custom Processors based on Extensible Platforms Fei Sun +, Srivaths Ravi ++, Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical

Domain Specific

Processor (DSP)

General Embedded

Processor0.1 - 1 MIPS/mw

1 - 10 MIPS/mw

50 - 100 MIPS/mw

500 - 1000 MOPS/mw

En

erg

y E

ffic

ien

cy

Fle

xib

ilit

y

ASIC

ASIP (Xtensa)

Domain Specific

Processor (AMD-K6E)0.1 - 1 MIPS/mW

1 - 10 MIPS/mW

50 - 100 MIPS/mW

500 - 1000 MOPS/mW

En

erg

y E

ffic

ien

cy

Fle

xib

ilit

y

Flexibility vs. Energy Efficiency

Page 6: Synthesis of Custom Processors based on Extensible Platforms Fei Sun +, Srivaths Ravi ++, Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical

Previous Work in ASIP Design ASIP architectures and overall design metho

dologies [Huang, 1994], [Adams, 1996], [Fisher, 1999], [K

ucukcakar, 1999] Application-specific instruction set selection

[Choi, 1999], [Gschwind, 1999], [Arnold, 1999] Low power ASIP design

[Kalambur, 1997], [Dougherty, 1999], [Ishihara, 2000], [Sami, 2001]

Commercial offerings Xtensa, ARCtangent, Jazz, SP-5flex, Carmel

Page 7: Synthesis of Custom Processors based on Extensible Platforms Fei Sun +, Srivaths Ravi ++, Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical

Pro

ce

ss

or

Co

ntr

ols

TRACE Port

JTAG Tap Control

On Chip Debug

Align and Decode

Coprocessor Register File

Coprocessor Execution Units

Window Register File

ALU & Address Generation

MAC 16

Designer Defined Instruction

Execution Unit

Instruction Memory or Cache & Tags

Branch Logic & Instruction Fetch

Date Memory or Cache &Tags

Pro

ce

ss

or In

terfa

ce

Write Buffer

Timers 1 to n

Special Function Register AccessData Address Watch 0 to n

Instruction Address Watch 0 to n

Instruction

Base ISA Feature

Configurable Function

Optional Function

Configurable & Optional Function

Extensible

Data

Instruction Address

Data Address

Exception Support

Interrupt Control

Me

mo

ry P

rote

cti

on

Un

itSource:

www.tensilica.com

Xtensa Architecture

Page 8: Synthesis of Custom Processors based on Extensible Platforms Fei Sun +, Srivaths Ravi ++, Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical

Xtensa Processor Design Flow

Processor Configuration Inputs

Designer-DefinedInstruction Descriptions

Configuration File

Configured GNUC/C++ Compiler Configured GNU

Assembler/Disassembler

Configured Instruction Set

Simulator/Emulator

Configured Processor HDL

Area, Power and Timing Estimation

Logic Synthesis (Synopsys or Ambit)

Block Place/Route (Avant! Or Cadence)

Timing Verification

Hardware Profile

Application Specific Compile, Assemble, Link

Application Simulation with ISS and/or Emulator

Software Debugging/Profiling

Application Source Code

Sample Application Data

Optimized SoftwareOptimized Hardware

Generator Output

Internal Database

Design data

Use of Generated Data

Source:www.tensilica.com

Page 9: Synthesis of Custom Processors based on Extensible Platforms Fei Sun +, Srivaths Ravi ++, Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical

Manual Custom Instruction Generation Procedure

Identify potential new instructions

Describe custom instructions

Insert custom instructions

Verify functional correctness

Profile, read source code

Understand source code

Rewrite source code

Slow

and error-prone

Page 10: Synthesis of Custom Processors based on Extensible Platforms Fei Sun +, Srivaths Ravi ++, Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical

Contributions of Our Work

Automatic custom instruction selection Application program to extensible processors

with custom instructions Features

Efficient design space search Use accurate information from instruction set

simulator and synthesis Bridge the gap between automatic synthesized

and manually designed architectures

Page 11: Synthesis of Custom Processors based on Extensible Platforms Fei Sun +, Srivaths Ravi ++, Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical

Automatic Custom Instruction Generation Flow

Profile C programGenerate program dependence graphs

Rank control blocks Generate templates

Select templates

Select custom instr combination

Generate custom instr combination

Build processor

Synthesize custom instr combination

Clock period/areaconstraints met?

Next instr combinationN

Profile C with instrcombination

Y

Application program (C)

1 2

34

5

14

15

16

17

18

19

Aristotle analysis system Profiler (xt-gprof)

Synthesize processor20

Generate individual custom instr

6 - 13

Page 12: Synthesis of Custom Processors based on Extensible Platforms Fei Sun +, Srivaths Ravi ++, Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical

Automatic Custom Instruction Generation Flow

Profile C programGenerate program dependence graphs

Rank control blocks Generate templates

Select templates

Select custom instr combination

Generate custom instr combination

Build processor

Synthesize custom instr combination

Clock period/areaconstraints met?

Next instr combinationN

Profile C with instrcombination

Y

Application program (C)

1 2

34

5

14

15

16

17

18

19

Aristotle analysis system Profiler (xt-gprof)

Synthesize processor20

Generate individual custom instr

6 - 13

Page 13: Synthesis of Custom Processors based on Extensible Platforms Fei Sun +, Srivaths Ravi ++, Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical

Example Illustration of Template Generation

c = a & 0xff; // node 1d = b & 0xff + c; // node 2e = d << 24; // node 3g = f & 0xff00; // node 4

2

1

3

4

0.03

0.03

0.030.06

a

fb c

d

e

g

Page 14: Synthesis of Custom Processors based on Extensible Platforms Fei Sun +, Srivaths Ravi ++, Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical

Example Illustration of Template Generation

c = a & 0xff; // node 1d = b & 0xff + c; // node 2e = d << 24; // node 3g = f & 0xff00; // node 4

2

1

3

4

0.03

0.03

0.030.06

a

fb c

d

e

g2

1

3

4

0.03

0.03

0.030.06

a

fb c

d

e

g

c = a & 0xff; // node 1d = b & 0xff + c; // node 2e = d << 24; // node 3g = f & 0xff00; // node 4

Page 15: Synthesis of Custom Processors based on Extensible Platforms Fei Sun +, Srivaths Ravi ++, Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical

1 2 3 4

2

1

3

4

0.03

0.03

0.030.06

a

fb c

d

e

g

Example Illustration of Template Generation

c = a & 0xff; // node 1d = b & 0xff + c; // node 2e = d << 24; // node 3g = f & 0xff00; // node 4

1

2

3

4

Basic templates

Page 16: Synthesis of Custom Processors based on Extensible Platforms Fei Sun +, Srivaths Ravi ++, Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical

2

3

1 2 3 4

Example Illustration of Template Generation

Basic templates

1

2

3

Dependent templates

1

2

2

1

3

4

0.03

0.03

0.030.06

a

fb c

d

e

g

c = a & 0xff; // node 1d = b & 0xff + c; // node 2e = d << 24; // node 3g = f & 0xff00; // node 4

Page 17: Synthesis of Custom Processors based on Extensible Platforms Fei Sun +, Srivaths Ravi ++, Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical

1 2 3 4

Example Illustration of Template Generation

Basic templates

1

2

3

1

2

2

3

2 4 3 4

1

24

2

3

4

1

2

3

4

1 4

Dependent templates

Independent templates2

1

3

4

0.03

0.03

0.030.06

a

fb c

d

e

g

c = a & 0xff; // node 1d = b & 0xff + c; // node 2e = d << 24; // node 3g = f & 0xff00; // node 4

Page 18: Synthesis of Custom Processors based on Extensible Platforms Fei Sun +, Srivaths Ravi ++, Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical

Key Observations for Pruning

Higher the weight of the template, higher the potential for improvement --- Amdahl’s law

Scope for optimization determined by computation --- No. of cycles needed for executing the template

Scope for optimization determined by read/write ports limitation --- Additional cycles needed for extra reading/writing of input/output variables

Page 19: Synthesis of Custom Processors based on Extensible Platforms Fei Sun +, Srivaths Ravi ++, Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical

Pruning Algorithm

Ranking criterion:

OriginalTime: Fraction of the total execution time of the original program spent in the template (weight)

In, Out: Number of inputs and outputs of the template, respectively

α, β: Number of inputs/outputs encoded in the instruction γ: No. of cycles needed for executing the template

Higher priority means greater potential for speed up

Page 20: Synthesis of Custom Processors based on Extensible Platforms Fei Sun +, Srivaths Ravi ++, Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical

12.7312.73

12.73

Template Generation with Pruning

10.51

7.92

4.05

2.13

Ranked pool of seed templates 12.73

Highest priority

5.36 1.18 16.35

Threshold: 0.1

Template set

Page 21: Synthesis of Custom Processors based on Extensible Platforms Fei Sun +, Srivaths Ravi ++, Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical

4.05

2.13

10.51

7.92

5.36

10.51

7.92

4.05

2.13

Template Generation with Pruning

12.73

Highest priority

5.36 1.18 16.35

12.73

Threshold: 0.1

Template set

Ranked pool of seed templates

Page 22: Synthesis of Custom Processors based on Extensible Platforms Fei Sun +, Srivaths Ravi ++, Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical

12.73

4.05

2.13

10.51

7.92

5.36

Template Generation with Pruning

12.73

Highest priority

1.18 16.35

1.18

Threshold: 0.1

Template set

Ranked pool of seed templates

Page 23: Synthesis of Custom Processors based on Extensible Platforms Fei Sun +, Srivaths Ravi ++, Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical

4.05

2.13

10.51

7.92

5.36

16.35

12.7316.35

Template Generation with Pruning

12.73

Highest priority

16.35

16.35

4.05

2.13

10.51

7.92

5.36

Threshold: 0.1

Template set

Ranked pool of seed templates

Page 24: Synthesis of Custom Processors based on Extensible Platforms Fei Sun +, Srivaths Ravi ++, Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical

No. of Templates vs. Threshold Ratio

Page 25: Synthesis of Custom Processors based on Extensible Platforms Fei Sun +, Srivaths Ravi ++, Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical

Automatic Custom Instruction Generation Flow

Profile C programGenerate program dependence graphs

Rank control blocks Generate templates

Select templates

Select custom instr combination

Generate custom instr combination

Build processor

Synthesize custom instr combination

Clock period/areaconstraints met?

Next instr combinationN

Profile C with instrcombination

Y

Application program (C)

1 2

34

5

14

15

16

17

18

19

Aristotle analysis system Profiler (xt-gprof)

Synthesize processor20

Generate individual custom instr6 - 13

Page 26: Synthesis of Custom Processors based on Extensible Platforms Fei Sun +, Srivaths Ravi ++, Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical

Automatic Custom Instruction Generation Flow (Contd.)

All templatesbuilt?

N

Y

Extract templates

Generate custom instr

Generate RTLVerilog

SynthesizeVerilog

Profile C with custominstr

Clock periodconstraint met?

Insert custominstr

TIE compiler

Synopsys design compiler

Y

N

Increase number of cycles

or increase clock period

Next tem

plate

5

6

7

8

9

10

11

12

13

Select templates

Generate individualcustom instr

6 - 13

Page 27: Synthesis of Custom Processors based on Extensible Platforms Fei Sun +, Srivaths Ravi ++, Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical

Automatic Custom Instruction Generation Flow (Contd.)

All templatesbuilt?

N

Y

Extract templates

Generate custom instr

Generate RTLVerilog

SynthesizeVerilog

Profile C with custominstr

Clock periodconstraint met?

Insert custominstr

TIE compiler

Synopsys design compiler

Y

N

Increase number of cycles

or increase clock period

Next tem

plate

5

6

7

8

9

10

11

12

13

Select templates

Generate individualcustom instr

6 - 13

Page 28: Synthesis of Custom Processors based on Extensible Platforms Fei Sun +, Srivaths Ravi ++, Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical

Custom Instruction Insertion

Care must be taken to insert custom instructions into appropriate places without affecting program’s functional correctness

If custom instructions need extra inputs (outputs), care must be taken to select appropriate variables to write to (read from) user-defined registers

Page 29: Synthesis of Custom Processors based on Extensible Platforms Fei Sun +, Srivaths Ravi ++, Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical

Example Illustration of Custom Instruction Insertion

1

4

3

5

2

3

4

1,2,5

(a) (b)

t = s >> 24; // 1r = t & 0xff; // 2a[5] = t + d; // 3m = b[0]; // 4y = x + m; // 5

m = b[0]; // 4y = CustomInstr(s,m); //1,2,5t = RUR(0); //1,2,5a[5] = t + d; // 3

Page 30: Synthesis of Custom Processors based on Extensible Platforms Fei Sun +, Srivaths Ravi ++, Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical

Example Illustration of Custom Instruction Insertion (Contd.)

(a) (b)

....offset = t + 1;

for (i=0; i<100; i++){ j = ....

result = offset + i * j;}....

....offset = t + 1;

for (i=0; i<100; i++){ j = ....

result = CustomInstr(i,j); }....

WUR(offset,0);

Page 31: Synthesis of Custom Processors based on Extensible Platforms Fei Sun +, Srivaths Ravi ++, Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical

Automatic Custom Instruction Generation Flow

Profile C programGenerate program dependence graphs

Rank control blocks Generate templates

Select templates

Select custom instr combination

Generate custom instr combination

Build processor

Synthesize custom instr combination

Clock period/areaconstraints met?

Next instr combinationN

Profile C with instrcombination

Y

Application program (C)

1 2

34

5

14

15

16

17

18

19

Aristotle analysis system Profiler (xt-gprof)

Synthesize processor20

Generate individual custom instr

6 - 13

Page 32: Synthesis of Custom Processors based on Extensible Platforms Fei Sun +, Srivaths Ravi ++, Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical

Custom Instruction Combination Selection --- Problem Statement

Given a set of non-overlapping custom instructions, with each instruction having several versions, find a version for each instruction such that performance is maximized while area is under a certain threshold

Page 33: Synthesis of Custom Processors based on Extensible Platforms Fei Sun +, Srivaths Ravi ++, Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical

Custom Instruction Combination Selection --- Flow ChartStart

All instrsanalyzed?

Add currentversion of currentinstr to solution

Performance upperbound is among the

best?

Area meetsconstraint?

All versionsconsidered?

Stop

Performance isamong the best?

Update bestsolutions

N Y

Y

Y

Y

Y

NNNext

version

Next instruction(recursive call)

Start

All instrsanalysized?

Add currentversion of

current instr insolution

Performance upbound is among the

best?

Area is undermaximum?

All versionsconsidered?

Stop

Performance isamong the best?

Update bestsolutions

N Y

Y

Y

Y

Y

NNNext

version

Next instructionrecursive call

Start

Allinstrsanalysized?

Addcurrentversion

ofcurrentinstr insolution

Performanceup bound isamong thebest?

Areais

undermaximum?

Allversions

considered?

Stop

Performance is

among thebest?

Updatebest

solutions

N Y

Y

Y

Y

Y

N

N

Next

version

Nextinstructionrecursive

call

N

N

Page 34: Synthesis of Custom Processors based on Extensible Platforms Fei Sun +, Srivaths Ravi ++, Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical

Automatic Custom Instruction Generation Flow

Profile C programGenerate program dependence graphs

Rank control blocks Generate templates

Select templates

Select custom instr combination

Generate custom instr combination

Build processor

Synthesize custom instr combination

Clock period/areaconstraints met?

Next instr combinationN

Profile C with instrcombination

Y

Application program (C)

1 2

34

5

14

15

16

17

18

19

Aristotle analysis system Profiler (xt-gprof)

Synthesize processor20

Generate individual custom instr

6 - 13

Page 35: Synthesis of Custom Processors based on Extensible Platforms Fei Sun +, Srivaths Ravi ++, Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical

Experimental MethodologyC Program

Automatic Custom Instruction Generation

Aristotle

Xtensa TIE Compiler

Synopsys Design Compiler

Xtensa GNU Profiler

Custom Processor(HDL Description)

NECCB11

TIE

Tensilica Processor Generator

Synopsys Design Compiler

Modified C program

Cross Compiler

ISS

Sente Wattwatcher

Area Clock Period

Execution Cycles

Power

Page 36: Synthesis of Custom Processors based on Extensible Platforms Fei Sun +, Srivaths Ravi ++, Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical

Experimental Results (Contd.)

Average

Performance improvement: 3.4X Energy reduction: 3.2X

Energy*delay reduction: 12.6X Area increase: 1.8%

Page 37: Synthesis of Custom Processors based on Extensible Platforms Fei Sun +, Srivaths Ravi ++, Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical

Conclusions

Automatic custom instruction synthesis for ASIPs Template generation/selection Custom instruction insertion Custom instruction combination selection

Experimental results 3.4X average performance improvement 12.6X average energy*delay reduction