design of application specific processor architecturesapplication specific processors (asips) „as...

76
Institute for Integrated Signal Processing Systems Design of Application Specific Processor Architectures Rainer Leupers RWTH Aachen University Software for Systems on Silicon [email protected]

Upload: others

Post on 23-Mar-2021

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

Institute for Integrated Signal Processing Systems

Design of Application SpecificProcessor Architectures

Rainer LeupersRWTH Aachen University

Software for Systems on [email protected]

Page 2: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

22005 © R. Leupers

Overview: Geography

Berlin

Aachen

Page 3: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

32005 © R. Leupers

About ISS

RWTH Aachen is a top-rankedtechnical university in Germany ISS institute

3 professors (Meyr, Ascheid, Leupers)18 Ph.D. students5 staff

Research on wireless communicationsystems

tight industry cooperationsorigin of several EDA spin-off companies(e.g. Cadis, Axys, LISATek)SSS group focuses on embedded processor design tools

Page 4: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

42005 © R. Leupers

Overview

1. Introduction2. ASIP design methodologies3. Software tools4. ASIP architecture design5. Case study6. Advanced research topics

Page 5: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

52005 © R. Leupers

1. Introduction

Page 6: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

62005 © R. Leupers

Embedded system design automation

Embedded systemsSpecial-purpose electronic devicesVery different from desktop computers

Strength of European IT marketTelecom, consumer, automotive, medical, ...Siemens, Nokia, Bosch, Infineon, ...

New design requirementsLow NRE cost, high efficiency requirementsReal-time operation, dependabilityKeep pace with Moore´s Law

Page 7: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

72005 © R. Leupers

What to do with chip area ?

Page 8: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

82005 © R. Leupers

Example: wireless multimedia terminals

Multistandard radioUMTSGSM/GPRS/EDGEWLANBluetoothUWB…

Multimedia standardsMPEG-4MP3AACGPSDVB-H…

Key issues:

• Time to market (≤ 12 months)

• Flexibility (ongoing standardupdates)

• Efficiency (battery operation)

Key issues:

• Time to market (≤ 12 months)

• Flexibility (ongoing standardupdates)

• Efficiency (battery operation)

Page 9: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

92005 © R. Leupers

Application specific processors (ASIPs)

„As the performance of conventional microprocessors improves, theyfirst meet and then exceed the requirements of most computingapplications. Initially, performance is key. But eventually, other factors, like customization, become more important to the customer...“

[M.J. Bass, C.M. Christensen: The Future of the Microprocessor Business, IEEE Spectrum 2002]

design budget = (semiconductor revenue) × (% for R&D)growth ≈ 15% ≈ 10%

# IC designs = (design budget) / (design cost per IC)growth ≈ 50-100% growth ≈ 15%

[Keutzer05]

→ Customizable application specific processors as reusable, programmable platforms

Page 10: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

102005 © R. Leupers

Efficiency and flexibility

Source: T.Noll, RWTH Aachen

HW Design

SWDesign

DigitalSignal

Processors

GeneralPurpose

Processors

103 . . . 104

Log

P O

W E

R

D I

S S

I P

A T

I O

N

105

. . .

106

ApplicationSpecific

ICs

PhysicallyOptimized

ICs

FieldProgrammable

Devices

Log

F L

E X

I B

I L

I T Y

Application Specific Instruction

Set Processors

Why use ASIPs?• Higher efficiency for given rangeof applications• IP protection• Cost reduction (no royalties)• Product differentiation

Log P E R F O R M A N C E

Page 11: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

112005 © R. Leupers

Standard-CPU vs. ASIP

Page 12: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

122005 © R. Leupers

2. ASIP designmethodologies

Page 13: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

132005 © R. Leupers

ASIP architecture exploration

Linker

Assembler

Compiler

Simulator

Profiler

Application

Linker

Assembler

Compiler

Simulator

Profiler

Application

initial processorarchitecture

Linker

Assembler

Compiler

Simulator

Profiler

Application

optimizedprocessor

architecture

Page 14: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

142005 © R. Leupers

Expression (UC Irvine)

Page 15: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

152005 © R. Leupers

Tensilica Xtensa/XPRES

Source: Tensilica Inc.

Page 16: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

162005 © R. Leupers

MIPS CorXtend/CoWare CorXpert

CorExtend Module

+

Profileand

identify custom

instructions

Hotspot

1

User Defined Instruction

User Defined Instruction

Replace critical codewith specialinstruction

2

Synthesize HW and profilewith

MIPSsimand

extensions

3

Page 17: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

172005 © R. Leupers

CoWare LISATek ASIP architecture exploration

Integrated embedded processor development environment Unified processor model in LISA 2.0 architecture description language (ADL)Automatic generation of:

SW toolsHW models

Page 18: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

182005 © R. Leupers

LISA operation hierarchy

addr cond opcode opnds

imm linear cycl control arithm move short long

add sub mul and or

main

decode

Reflects hierarchicalorganization of ISAs

Page 19: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

192005 © R. Leupers

LISA operations structure

LISA operation

BEHAVIOR

Computation and processor state update

SYNTAXAssembly syntax

CODINGBinary coding

DECLAREReferences to other operations

EXPRESSION

Resource access, e.g. registers

ACTIVATION

Initiate “downstream” operations in pipe

SEMANTICS

C compiler generation

Page 20: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

202005 © R. Leupers

LISA operation example

OPERATION ADD{

DECLARE{

GROUP src1, src2, dest = { Register } }CODING { 0b1011 src1 src2 dest }

SYNTAX { “ADD” dest “,” src1 “,” src2 }

BEHAVIOR { dest = src1 + src2; }}

OPERATION Register{

DECLARE{

LABEL index; }

CODING { index }

SYNTAX { “R” index }EXPRESSION{ R[index] }

}

C/C++ Code

ADD

Register Register Register

src1src1 src2src2 destdest

Page 21: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

212005 © R. Leupers

Exploration/debugger GUI

• Application simulation• Debugging• Profiling• Resource utilization analysis• Pipeline analysis• Processor model debugging• Memory hierarchy exploration• Code coverage analysis• ...

• Application simulation• Debugging• Profiling• Resource utilization analysis• Pipeline analysis• Processor model debugging• Memory hierarchy exploration• Code coverage analysis• ...

Page 22: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

222005 © R. Leupers

Some available LISA 2.0 models

DSP:Texas Instruments TMS320C54x

Analog DevicesADSP21xx

Motorola 56000

RISC:MIPS32 4K

ESA LEON SPARC 8

ARM7100

ARM926

• VLIW:

– Texas Instruments TMS320C6x

– STMicroelectronicsST220

• µC:

– MHS80C51

• ASIP:

– Infineon PP32 NPU

– Infineon ICore

– MorphICs DSP

Page 23: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

232005 © R. Leupers

3. Software tools

Page 24: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

242005 © R. Leupers

Tools generated from processor ADL model

Linker

Assembler

Compiler

Simulator

Profiler

Application

Page 25: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

252005 © R. Leupers

Instruction set simulation

Interpretive:• flexible• slow (~ 100 KIPS) Memory

ExecuteDecodeApplication Instruction

Run-TimeRun-Time

Compiled:• fast (> 10 MIPS)• inflexible • high memory

consumption

CompiledSimulation

Application

Compile-TimeCompile-Time Run-TimeRun-Time

ProgramMemory

SimulationCompiler Execute

Instruction BehaviorInstruction BehaviorInstruction Behavior

JIT-CCS™:• „just-in-time“

compiled• SW simulation cache• fast and flexible

CompiledSimulation

Cache

Run-TimeRun-Time

ProgramMemory

Application Decode

Instruction Instruction BehaviorInstructionInstruction Instruction Behavior

Execute

Page 26: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

262005 © R. Leupers

JIT-CC simulation performance

0

1

2

3

4

5

6

7

8

9

Compil

edInt

erpret

ive 8 16 32 64 128

256

512

1024

2048

4096

8192

1638

432

768

0

10

20

30

40

50

60

70

80

90

100

Cache size [records]

Perf

orm

ance

[MIP

S]C

acheM

issR

atio[%

]

• Dependent on simulation cache size• 95% of compiled simulation performance @ 4096 cache

blocks (10% memory consumption of compiled sim.)• Example: ST200 VLIW DSP

Page 27: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

272005 © R. Leupers

Why care about C compilers?

Embedded SW design becoming predominant manpowerfactor in system designCannot develop/maintain millions of code lines in assemblylanguageMove to high-level programming languages

Page 28: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

282005 © R. Leupers

Why care about compilers?

Trend towards heterogeneous multiprocessor systems-on-chip (MPSoC)Customized application specific instruction set processors(ASIPs) are key MPSoC componentsHow to achieve efficient compiler support for ASIPs?

ASICASIC CPUCPU ASIPASIP

CPUCPUASIPASIP ASIPASIP

MemoryMemory MemoryMemory MemoryMemory

ASICASIC CPUCPU

MemMem

Page 29: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

292005 © R. Leupers

C compiler in the exploration loop

„„Compiler/Architecture CoCompiler/Architecture Co--DesignDesign““

Efficient C-compilers cannot bedesigned for ARBITRARY architectures!

ApplicationApplicationSoftwareSoftware CompilerCompiler ProcessorProcessor ResultsResults

Compiler and processor form a UNIT that needs to beoptimized!“Compiler-friendliness“ needs to be taken into accountduring the architecture exploration!

Page 30: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

302005 © R. Leupers

Retargetable compilers

source code

asmcode

CompilerCompiler

processormodel

Retargetable compiler

source code

asmcode

Classical compiler

CompilerCompilerprocessor

model

Page 31: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

312005 © R. Leupers

GNU C compiler (gcc)

• Probably the most widespread retargetable compiler

• Mostly used as a native Unix/Linux compiler, but may operate as a cross-compiler, too

• Support for C/C++, Java, and other languages

• Comes with comprehensive support software, e.g. runtime and standard libraries, debug support

• Portable to new architectures by means of machine description file and C support routines

“The main goal of GCC was to make a good, fast compiler for

machines in the class that the GNU system aims to run on: 32-bit

machines that address 8-bit bytes and have several general registers.

Elegance, theoretical power and simplicity are only secondary.”

“The main goal of GCC was to make a good, fast compiler for

machines in the class that the GNU system aims to run on: 32-bit

machines that address 8-bit bytes and have several general registers.

Elegance, theoretical power and simplicity are only secondary.”

Page 32: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

322005 © R. Leupers

LANCE (Univ. of Dortmund/RWTH Aachen)

Modular retargetable C compiler systemwww.lancecompiler.com

lance2.h

header file

liblance2.a

C++ library

LANCE libraryC frontend

IR-C

IR optimization 1

IR optimization n

backendinterface

LANCE tools

used by

lance2.h

header file

liblance2.a

C++ library

LANCE libraryC frontend

IR-C

IR optimization 1

IR optimization n

backendinterface

LANCE tools

used by

Page 33: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

332005 © R. Leupers

Executable C based IR in the LANCE compiler

void f(){int i,A[10];i = A[2]++

> 1 ? 2 : 3;

}

C source code int A[10];char *t1,*t3;int i,t2,t5,t6,t7,t8;

int *t4;

symboltable

t3 = (char *)A; // cast base to char*t2 = 2 * 4; // compute offsett1 = t3 + t2; // compute eff addrt4 = (int *)t1; // cast back to int*

t5 = *t4; // load value from memory

arrayaccesst6 = t5 + 1; // increment

*t4 = t6; // store back into A[2]

incrementt7 = t5 > 1; // compareif (t7) goto L1; // jump if >t8 = 3; // load 3 if <=goto L2; // goto join point

L1: t8 = 2; // load 2 if >

L2: i = t8; // move result into i

conditionalexpression

Page 34: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

342005 © R. Leupers

CoSy compiler system (ACE)

© ACE - Associated Compiler Experts

• Universal retargetable C/C++ compiler

• Extensible intermediate representation (IR)

• Modular compiler organization

• Generator (BEG) for code selector, register allocator, scheduler

Page 35: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

352005 © R. Leupers

ACE CoSy system structure

Backend components

.c

.asm

ParserEngine

IR

lowerdatagen

schedmatch

graemit

Optimizations.sdlIR-Description

ArchitectureSpecificBackendEngines

.edl Control Flowthrough Compiler

.cgdCode Generator Description

.tdfArchitecture Parameters

BEG

Page 36: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

362005 © R. Leupers

LISATek C compiler generation

Autom. analyses

Manual refinement

GUI

CoSy systemCoSy system

C CompilerC Compiler

LISAprocessor model

SYNTAX {“ADD“ dst, src1, src2

}

CODING {0b0010 dst src1 src2

}

BEHAVIOR { ALU_read (src1, src2);ALU_add ();Update_flags ();writeback (dst);

}

SEMANTICS {src1 + src2 dst;

}

SYNTAX {“ADD“ dst, src1, src2

}

CODING {0b0010 dst src1 src2

}

BEHAVIOR { ALU_read (src1, src2);ALU_add ();Update_flags ();writeback (dst);

}

SEMANTICS {src1 + src2 dst;

}

Page 37: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

372005 © R. Leupers

LISATek compiler generation

Frontend Opt Backend

ASM-CodeLD R1, [R2]ADD R1, #1SHL R1, #3…

C-Codeint a,b,c;a = b+1;c = a<<3;…

Code-Selector

Register-Allocator Scheduler

Instruction-Fetch

Mem

ALUFE DE EX

WBWrite-Back

Pipeline Control

Decoder

Registers

Decoder

Jump

DataRAM

ProgRAM

ADD …

…R[i] …

…#1

R[0..31]

JMPADDSUBSUB MUL

JMP 2 1

ADD 2 3

Page 38: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

382005 © R. Leupers

Compiled code quality: MIPS example

LISATek generated C-CompilerOut-of-the-box C-CompilerNo manual optimizationsDevelopment time of model

approx. 2 weeks

LISATek generated C-CompilerOut-of-the-box C-CompilerNo manual optimizationsDevelopment time of model

approx. 2 weeks

gcc C-Compilergcc with MIPS32 4kc backendUsed by most MIPS usersLarge group of developers,

several man-years of optimization

gcc C-Compilergcc with MIPS32 4kc backendUsed by most MIPS usersLarge group of developers,

several man-years of optimization

Cycles

0

20.000.000

40.000.000

60.000.000

80.000.000

100.000.000

120.000.000

140.000.000

gcc,-O4 gcc,-O2 cosy,-O4 cosy,-O2

Cycles

Size

0

10.000

20.000

30.000

40.000

50.000

60.000

70.000

80.000

gcc,-O4 gcc,-O2 cosy,-O4 cosy,-O2

SizeOverhead of 10% in cycle count and 17% in code densityOverhead of 10% in cycle count and 17% in code density

Page 39: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

392005 © R. Leupers

Demands on code quality

Compilers for embedded processors have to generateextremely efficient code

Code size: » system-on-chip» on-chip RAM/ROM

Performance:» real-time constraints

Power/energy consumption:» heat dissipation» battery lifetime

Page 40: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

402005 © R. Leupers

Compiler flexibility/code quality trade-off

variety ofembeddedprocessors

specialization

DSP NPU VLIW

dedicatedoptimizationtechniques

retargetablecompilation

unification

Page 41: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

412005 © R. Leupers

Adding processor-specific code optimizations

High-level (compiler IR)Enabled by CoSy´s engine concept

Low-level (ASM):

.C.C LISA CCompilerLISA C

Compiler Unscheduled.asm

Unscheduled.asm

Binary Code Generation

AssemblerAssembler LinkerLinker .out

Assembly API

Optimization 3Optimization 3Optimization 2Optimization 2Optimization 1Optimization 1Scheduled &Optimized

.asm

Scheduled &Optimized

.asm

Page 42: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

422005 © R. Leupers

Embedded processors are not compiler-friendly

Designed for efficiencyE.g. fixed-point DSP data paths:

Special purpose registers, constrained parallelism, ...Challenge for compilers that usually prefer orthogonal „compiler-friendly“ architectures

MR

MFMX MY

*+,-

AR

AFAX AY

+,-

DP

MR

MFMX MY

*+,-

AR

AFAX AY

+,-

DDPP

mult

ALU

ACCU

PR

TR MEM

mult

ALU

ACCU

PR

TR MEM

TI C25 ADSP210x

Page 43: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

432005 © R. Leupers

DSP address code optimization

• Based on address generation unit (AGU) support• Address arithmetic in parallel to central data path

+/-+/-

MEMMEM

1

instruction field

modifyregistersmodify

registers

addressregistersaddressregisters

Support forauto-increment andauto-modify

Examples:• TI C2x/C5x• Motorola 56000• ADSP-210x• AMS Gepard core

Page 44: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

442005 © R. Leupers

DSP address code optimization

variable set: { a, b, c, d }access sequence: b, d, a, c, d, a, c, b, a, d, a, c, d

a b

c d

1

11

2

34

access graph:

abcd

0123

alphabeticlayout

cadb

0123

optimizedlayout AR = 3

AR - -AR - -AR - -AR += 2AR - -AR - -AR += 3AR -= 2AR ++AR - -AR - -AR += 2cost: 5

AR = 1AR += 2AR -= 3AR += 2AR ++AR -= 3AR += 2AR - -AR - -AR += 3AR -= 3AR += 2AR ++cost: 9

a b

c d

34 1

max. Hamiltonian path:

www.address-code-optimization.org

Page 45: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

452005 © R. Leupers

Source-level code optimization

Algorithmdesign

C codegeneration

Compilationto assembly

Poor codequality!Poor codequality!

„Softwarewashing“

„Softwarewashingmachine“

– better performance– lower code size– highly reusable

– better performance– lower code size– highly reusable

Page 46: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

462005 © R. Leupers

CoWare SPW example

Sink

Source

K1 K2 K3

K4K5

z-1 z-1∑∑ ∑

(*buffer_45) = inst__9_coef * (*buffer_56);(*buffer_44) = inst__4_coef * (*buffer_56);(*buffer_43) = inst__8_coef * (*buffer_56);(*buffer_58) = (*buffer_51) + (*buffer_43);(*buffer_46) = inst__0_coef * (*buffer_58);(*buffer_59) = (*buffer_45) + (*buffer_46);(*buffer_47) = inst__10_coef * (*buffer_58);

{/* RO: spb/dly.symbol / 52 */(*buffer_52) = inst__11__past_value;

}{/* RI: spb/dly.symbol / 52 */inst__11__past_value = (*buffer_59);

}

(*buffer_60) = (*buffer_44) + (*buffer_52)+ (*buffer_47);inst__2__past_value = (*buffer_60);

Very conservative, block-oriented code generation schemeEven simple code transformations have great impact, e.g.

Variable localizationIf-statement mergingOptimized data type selection

Page 47: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

472005 © R. Leupers

4. ASIP architecture design

Page 48: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

482005 © R. Leupers

ASIP implementation after exploration

Page 49: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

492005 © R. Leupers

Unified Description Layer

G a t e – L e v e l

Register-Transfer-Level

L I S A

HDL Generation

Gate–Level Synthesis(e.g. SYNOPSYS design compiler)

Page 50: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

502005 © R. Leupers

Challenges in Automated ASIP Implementation

Instructions

Arithmetic Control

Mul

Mac

JMP

BRC

Independent description of instruction behavior:

+ Efficient Design Space Exploration

ADL:

1:1Mapping

HDL:

Multiplier(MUL)

Multiplier(MAC)

Independent mapping tohardware blocks:

- Insufficient architectural efficiencyby 1:1 mapping

Page 51: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

512005 © R. Leupers

Unified Description Layer

G a t e – L e v e l

Register-Transfer-Level

Unified Description Layer

L I S A

Structure & Mapping(incl. JTAG/DEBUG)

Optimizations

Backend (VHDL, Verilog, SystemC)

Gate–Level Synthesis(e.g. SYNOPSYS design compiler)

Page 52: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

522005 © R. Leupers

Optimization strategies

LISA: separate descriptionsfor separate instructions

Goal: share hardware forseparate instructions

Instruction A Instruction B

LISA Operation A

LISA Operation B

MutualExclusiveness

+

a b

x

+

c d

yPossible Optimizations• ALU Sharing

x,y

+

a c b d

Page 53: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

532005 © R. Leupers

Optimization strategies

AddressA

DataA

Register ArrayDataB

AddressB

LISA Operation A

LISA Operation B

Instruction A Instruction B

Path PA

Path PB

……

LISA: separate descriptionsfor separate instructions

Goal: same hardware forseparate instructions

Possible Optimizations• ALU Sharing• Path Sharing• ...

MutualExclusiveness

DataA, DataB

AddressA

AddressBRegister Array

ResourceSharing

Page 54: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

542005 © R. Leupers

5. Case study

Page 55: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

552005 © R. Leupers

Motorola 6811

Project Goals:

• Performance (MIPS) must be increased

• Compatibility on the assembly levelfor reuse of legacy code(Integration into existing tool flow)

• Royalty free design

compatible architecture developed with LISA using RTL processor synthesis

Page 56: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

562005 © R. Leupers

Motorola 6811

68116812

010010101001101011100101101011110000110110110100

legacy code

?

compiler

assembly

assembler

Increase

Performance!!!

(MIPS)Increase

Performance!!!

(MIPS)

Page 57: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

572005 © R. Leupers

Motorola 6811

010010101001101011100101101011110000110110110100

Bluetooth app.

SynthesizedArchitecture

6811 compiler

assembly

assembler

LISA

assembly levelcompatible

Page 58: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

582005 © R. Leupers

Architecture Development

original 6811 Processor LISA 6811 Processor

8 bit instructions 16 bit instructions

16 bit instructions 32 bit instructions

24 bit instructions

32 bit instructions

40 bit instructions

Instruction is fetched by 8 bit blocks:

up to 5 cycles for fetching!

Instruction is fetched by 8 bit blocks:

up to 5 cycles for fetching!

16 bit are fetched simultaneously:

max 2 cycles for fetching!

+ pipelined architecture+ possibility for special instructions

16 bit are fetched simultaneously:

max 2 cycles for fetching!

+ pipelined architecture+ possibility for special instructions

Page 59: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

592005 © R. Leupers

Tools Flow and RTL Processor Synthesis

C-Application

6811 compiler

AssemblyLISA model

LISA assembler

Executable

LISA tools

6811 compatible architecturegenerated completely in VHDL

1) VLSI Implementation: Area: <17kGates

Clock Speed: ~154 MHz2) Mapped onto XILINX FPGA

Page 60: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

602005 © R. Leupers

Retinex ASIP (digital image enhancement)

IN OUT

Virtex II FPGA board

ASIC (0.18 µm):

• 102 kGates, 93 MHz

ASIP (0.13 µm):

• 124 kGates, 154 MHz

• SW programmable!

ASIC (0.18 µm):

• 102 kGates, 93 MHz

ASIP (0.13 µm):

• 124 kGates, 154 MHz

• SW programmable!

Page 61: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

612005 © R. Leupers

6. Advanced researchtopics

Page 62: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

622005 © R. Leupers

Generalized ASIP architecture design flow

Algorithm design(Matlab, SPW, ...)

C code generationor implementation

initialarchitecture

architectureoptimization

Profiling

14%

27%

1%

3%

0%

1%

4%20%1%

0%

26%

3%

Page 63: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

632005 © R. Leupers

A closer look at profilers: ASM level

Algorithm design(Matlab, SPW, ...)

C code generationor implementation

initialarchitecture

architectureoptimization

ASM leve

l

– requires full architecture description– slow (~1000x native C)

Page 64: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

642005 © R. Leupers

A closer look at profilers: source level

Algorithm design(Matlab, SPW, ...)

C code generationor implementation

initialarchitecture

architectureoptimization

sample app: image corner detection

gprof

hot spot

Which instructions implementthis code efficiently?

Page 65: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

652005 © R. Leupers

Do not neglect compiler optimizations

gcovexeccount

Algorithm design(Matlab, SPW, ...)

C code generationor implementation

initialarchitecture

architectureoptimization

p = in + (corner_list[n].y-1)*x_size + corner_list[n].x - 1;

• 5x ADD• 2x SUB• 3x MUL• 2x LOAD• ...

• 5x ADD• 2x SUB• 1x MUL• 2x LOAD• ...

real code (optimized)

Wrong ISA decisions may result!

Page 66: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

662005 © R. Leupers

µ-profiling approach

compile &execute

generatelow-level C code

performcompiler

optimizations

instrumentcode

count ADD

count MUL

count LOAD

Page 67: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

672005 © R. Leupers

Profiler features summary

C sourcelevel (e.g.

gprof)

assemblylevel (e.g. LISATek)

Micro-profiler

primaryapplication

needsarchitectural

details

speed High Low Medium

Profilinggranularity

coarse fine fine

Source codeoptimization

ISA and architecture optimization

ISA and architecture optimization

No Yes No

Page 68: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

682005 © R. Leupers

Micro-profiler in the design flow

Precise profiling of:Operator execution countData type useVariable word lengthsMemory access patterns

Guide designer in basicarchitecture decisions:

Inclusion of dedicated functionunits (floating point, addressgen.)Leave out unused instructionsOptimal data path word lengthsDesign of memory hierarchy(cache, scratchpad)

Export profiling data for automaticISA synthesis tool

Effects of compiler optimizationspredicted early

Page 69: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

692005 © R. Leupers

Custom instruction set synthesis: recent approaches

J. Yang, C. Kyung et al.: MetaCore: An Application Specific DSP Development System, DAC 1998M. Gschwind: Instruction Set Selection for ASIP Design, CODES 1999K. Kücükcakar: An ASIP Design Methodology for Embedded Systems, CODES 1999H. Choi, C. Kyung et al.: Synthesis of Application Specific Instructionsfor Embedded DSP Software, IEEE Trans. CAD 1999F. Sun, S. Ravi et al.: Synthesis of Custom Processors based on Extensible Platforms, ICCAD 2002D. Goodwin, D. Petkov: Automatic Generation of Application SpecificProcessors, CASES 2003K. Atasu, L. Pozzi, P. Ienne: Automatic Application-SpecificInstruction-Set Extensions under Microarchitectural Constraints, DAC 2003N. Clark, H. Zhong, S. Mahlke: Processor Acceleration throughAutomated Instruction Set Customization, MICRO 2003... and many others

Page 70: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

702005 © R. Leupers

Atasu/Pozzi/Ienne (EPFL)

• Branch-and-Bound likealgorithm for custom patternidentification under I/O constraints

• Capable of identifying optimal complex and disconnectedpatterns

• Requires fast and accurate costestimation of speedup due to candidate patterns

Page 71: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

712005 © R. Leupers

Custom ISA synthesis: ISS approach

Application codeprofiling

(gprof, µprof)

Application codeprofiling

(gprof, µprof)

Hot spot(s)graph modelHot spot(s)

graph modelSemi-automatic

ISA synthesis (GUI)Semi-automatic

ISA synthesis (GUI)

Performanceestimation

Performanceestimation

Processorgeneration

(e.g. CorXpert)

Processorgeneration

(e.g. CorXpert)

Performanceevaluation

Performanceevaluation

optimalcustom instr

Architecturalconstraints

Architecturalconstraints

User preferences

User preferences

Page 72: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

722005 © R. Leupers

Abstract modeling: optimal data flow graph covering

-<<

>>

+

-

+ *

*

/

+

*-

instr 1

instr 2

instr 3

HOT SPOT

• Precise mathematical problem formulation

• Maximize application speedup under given architecture and area constraints

• Solve optimization problem with combination of integer linear programming and heuristics

• Using CoWare CorXpert as backend

• Precise mathematical problem formulation

• Maximize application speedup under given architecture and area constraints

• Solve optimization problem with combination of integer linear programming and heuristics

• Using CoWare CorXpert as backend

Page 73: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

732005 © R. Leupers

Outlook: reconfigurable ASIPs

Currently: (one-time) configurable ASIPsCombination of ASIP and FPGA:

“field-configurable ASIPs”Fast adaptations w/o HW modifications

reconfigurableBase processor architecture

Application specificcomponents

Embedded FPGA

Page 74: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

742005 © R. Leupers

Outlook: compilation for heterogeneous MPSoC´s

Urgently needed, butvirtually not presenttodayNo tools even forsimplest platforms(e.g. TI OMAP)Need for automatedspatial and temporal task-to-processormapping

Task aTask aTask bTask b

Task cTask c

Task dTask d

Task eTask e

Task fTask f

Task gTask g

ASICASIC CPUCPU ASIPASIP

CPUCPUASIPASIP ASIPASIP

MemoryMemory MemoryMemory MemoryMemory

ASICASIC CPUCPU ASIPASIP

CPUCPUASIPASIP ASIPASIP

MemoryMemory MemoryMemory MemoryMemory

a.ca.c b.cb.c d.cd.c f.cf.c

g.cg.c c.cc.c e.ce.c

Page 75: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

752005 © R. Leupers

References

R. Leupers: Code Optimization Techniques for Embedded Processors - Methods, Algorithms, and Tools, Kluwer, 2000R. Leupers, P. Marwedel: Retargetable Compiler Technology for Embedded Systems - Tools and Applications, Kluwer, 2001A. Hoffmann, H. Meyr, R. Leupers:Architecture Exploration for Embedded Processors with LISA, Kluwer, 2002C. Rowen, S. Leibson: Engineering the Complex SoC: Fast, Flexible Design with Configurable Processors, Prentice Hall, 2004M. Gries, K. Keutzer, et al.: Building ASIPs: The Mescal Methodology, Springer, 2005P. Ienne, R. Leupers (eds.): Customizable and Configurable Embedded Processor Cores, Morgan Kaufmann, to appear 2006

Page 76: Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

Institute for Integrated Signal Processing Systems

Thank you !