improving embedded system software speed and energy using microprocessor/fpga platform ics

43
Frank Vahid, UC Riverside 1 Software Speed and Energy using Microprocessor/FPGA Platform ICs Frank Vahid Associate Professor Dept. of Computer Science and Engineering University of California, Riverside Also with the Center for Embedded Computer Systems at UC Irvine http://www.cs.ucr.edu/~vahid This research has been supported by the National Science Foundation, NEC, Trimedia, and Triscend

Upload: hammett-hurst

Post on 01-Jan-2016

27 views

Category:

Documents


0 download

DESCRIPTION

Improving Embedded System Software Speed and Energy using Microprocessor/FPGA Platform ICs. Frank Vahid Associate Professor Dept. of Computer Science and Engineering University of California, Riverside Also with the Center for Embedded Computer Systems at UC Irvine - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Improving Embedded System Software Speed and Energy using Microprocessor/FPGA Platform ICs

Frank Vahid, UC Riverside

1

Improving Embedded System Software Speed and Energy usingMicroprocessor/FPGA Platform ICs

Frank VahidAssociate Professor

Dept. of Computer Science and EngineeringUniversity of California, Riverside

Also with the Center for Embedded Computer Systems at UC Irvine

http://www.cs.ucr.edu/~vahid

This research has been supported by the National Science Foundation, NEC, Trimedia, and Triscend

Page 2: Improving Embedded System Software Speed and Energy using Microprocessor/FPGA Platform ICs

Frank Vahid, UC Riverside 2

General Purpose vs. Special Purpose

Standard tradeoff

Oct. 14, 2002, Cincinnati, Ohio -- physician at Cincinnati Children’s Hospital Medical Center report duct tape effective at treating warts.

Amazing to think this came from wolves

Page 3: Improving Embedded System Software Speed and Energy using Microprocessor/FPGA Platform ICs

Frank Vahid, UC Riverside 3

General Purpose vs. Single Purpose Processors

Designers have long known that:

General-purpose processors are flexible

Single-purpose processors are fast

ENIAC, 1940’sIts flexibility was the big deal

DatapathController

Controllogic

State register

Datamemory

i

total

+

IR PC

Registerfile

GeneralALU

DatapathController

Program memory

Assembly code for:

total = 0 for i =1 to

Controllogic and

State register

Datamemory

total = 0for i = 1 to N loop total += M[i]end loop

General purpose

Single purposeOR

FlexibilityDesign cost

Time-to-market

PerformancePower efficiency

Size

Page 4: Improving Embedded System Software Speed and Energy using Microprocessor/FPGA Platform ICs

Frank Vahid, UC Riverside 4

Mixing General and Single Purpose Processors

A.k.a. Hardware/software partitioning

Hardware: single-purpose processors

coprocessor, accelerator, peripheral, etc.

Software: general-purpose processors

Though hardware underneath!

Especially important for embedded systems

Computers embedded in devices (cameras, cars, toys, even people)

Speed, cost, time-to-market, power, size, … demands are tough

Microcontroller

CCD preprocessor Pixel coprocessorA2D

D2A

JPEG codec

DMA controller

Memory controller ISA bus interface UART LCD control

Display control

Multiplier/Accumulator

Digital camera chip

lens

CCD

Sw only Hw Only Hw/ Sw

FlexibilitySpeedEnergy

Page 5: Improving Embedded System Software Speed and Energy using Microprocessor/FPGA Platform ICs

Frank Vahid, UC Riverside 5

How is Partitioning Done for Embedded Systems?

Partitioning into hw and sw blocks done early

During conceptual stage Sw design done separately

from hw design Attempts since late 1980s to

automate not yet successful Partitioning manually is

reasonably straightforward Spec is informal and not

machine readable Sw algorithms may differ from

hw algorithms No compelling need for tools

Informal spec

Sw spec

Sw design Hw design

Hw spec

Processor

ASIC

System Partitioning

Page 6: Improving Embedded System Software Speed and Energy using Microprocessor/FPGA Platform ICs

Frank Vahid, UC Riverside 6

New Platforms Invite New Efforts in Hw/Sw Partitioning

New single-chip platforms contain both general-purpose processor and an FPGA FPGA: Field-programmable

gate array Programmable just like

software Flexible Intended largely to implement

single-purpose processors Can we perform a later

partitioning to improve the software too?

Sw spec

Sw design Hw design

Hw spec

Processor + FPGA

ASIC

System Partitioning

Informal spec

Processor + FPGA

Partitioning

Page 7: Improving Embedded System Software Speed and Energy using Microprocessor/FPGA Platform ICs

Frank Vahid, UC Riverside 7

Commercial Single-Chip Microprocessor/FPGA Platforms

Triscend E5 chip

Con

fig

ura

ble

log

ic8051 processor plus other peripherals

Memory

Triscend E5: based on 8-bit 8051 CISC core (2000) 10 Dhrystone MIPS at

40MHz up to 40K logic gates Cost only about $4

Page 8: Improving Embedded System Software Speed and Energy using Microprocessor/FPGA Platform ICs

Frank Vahid, UC Riverside 8

Single-Chip Microprocessor/FPGA Platforms

Atmel FPSLIC Field-Programmable

System-Level IC Based on AVR 8-bit

RISC core 20 Dhrystone MIPS 5k-40k logic gates $5-$10

Courtesy of Atmel

Page 9: Improving Embedded System Software Speed and Energy using Microprocessor/FPGA Platform ICs

Frank Vahid, UC Riverside 9

Single-Chip Microprocessor/FPGA Platforms

Triscend A7 chip (2001)

Based on ARM7 32-bit RISC processor 54 Dhrystone MIPS

at 60 MHz Up to 40k logic

gates $10-$20 in volume

Courtesy of Triscend

Page 10: Improving Embedded System Software Speed and Energy using Microprocessor/FPGA Platform ICs

Frank Vahid, UC Riverside 10

Single-Chip Microprocessor/FPGA Platforms

Altera’s Excalibur EPXA 10 (2002)

ARM (922T) hard core 200 Dhrystone MIPS at

200 MHz ~200k to ~2 million

logic gates

Source: www.altera.com

Page 11: Improving Embedded System Software Speed and Energy using Microprocessor/FPGA Platform ICs

Frank Vahid, UC Riverside 11

Single-Chip Microprocessor/FPGA Platforms

Xilinx Virtex II Pro (2002)

PowerPC based 420 Dhrystone MIPS

at 300 MHz 1 to 4 PowerPCs 4 to 16 gigabit

transceivers 12 to 216 multipliers Millions of logic gates 200k to 4M bits RAM 204 to 852 I/O $100-$500 (>25,000

units)

Con

fig

.lo

gic

Up to 16 serial transceivers• 622 Mbps to 3.125 Gbps622 Mbps to 3.125 Gbps

Pow

erP

Cs

Courtesy of Xilinx

Page 12: Improving Embedded System Software Speed and Energy using Microprocessor/FPGA Platform ICs

Frank Vahid, UC Riverside 12

Why wouldn’t future microprocessor chips include some amount of on-chip FPGA?

Single-Chip Microprocessor/FPGA Platforms

One argument against – area Lots of silicon area taken up by

FPGA FPGA about 20-30 times less

area efficient than custom logic

FPGA used to be for prototyping, too big for final products

But chip trends imply that FPGAs will be O.K. in final products…

Page 13: Improving Embedded System Software Speed and Energy using Microprocessor/FPGA Platform ICs

Frank Vahid, UC Riverside 13

How Much is Enough?

Perhaps a bit small

Page 14: Improving Embedded System Software Speed and Energy using Microprocessor/FPGA Platform ICs

Frank Vahid, UC Riverside 14

How Much is Enough?

Reasonably sized

Page 15: Improving Embedded System Software Speed and Energy using Microprocessor/FPGA Platform ICs

Frank Vahid, UC Riverside 15

How Much is Enough?

Probably plenty big for most of us

Page 16: Improving Embedded System Software Speed and Energy using Microprocessor/FPGA Platform ICs

Frank Vahid, UC Riverside 16

How Much is Enough?

More than typically necessary

Page 17: Improving Embedded System Software Speed and Energy using Microprocessor/FPGA Platform ICs

Frank Vahid, UC Riverside 17

How Much Custom Logic is Enough?

1993: ~ 1 million logic transistors

IC package IC

Perhaps a bit small

8-bit processor: 50,000 tr.Pentium: 3 million tr.

MPEG decoder: several million tr.

Page 18: Improving Embedded System Software Speed and Energy using Microprocessor/FPGA Platform ICs

Frank Vahid, UC Riverside 18

1996: ~ 5-8 million logic transistors

Reasonably sized

How Much Custom Logic is Enough?

Page 19: Improving Embedded System Software Speed and Energy using Microprocessor/FPGA Platform ICs

Frank Vahid, UC Riverside 19

1999: ~ 10-50 million logic transistors

Probably plenty big for most of us

How Much Custom Logic is Enough?

Page 20: Improving Embedded System Software Speed and Energy using Microprocessor/FPGA Platform ICs

Frank Vahid, UC Riverside 20

2002: ~ 100-200 million logic transistors

More than typically necessary

How Much Custom Logic is Enough?

Page 21: Improving Embedded System Software Speed and Energy using Microprocessor/FPGA Platform ICs

Frank Vahid, UC Riverside 21

2008: >1 BILLION logic transistors

1993: 1 M

Perhaps very few people

could design this

How Much Custom Logic is Enough?

Page 22: Improving Embedded System Software Speed and Energy using Microprocessor/FPGA Platform ICs

Frank Vahid, UC Riverside 22

Very Few Companies Can Design High-End ICs

Designer productivity growing at slower rate 1981: 100 designer months ~$1M 2002: 30,000 designer months ~$300M

10,000

1,000

100

10

1

0.1

0.01

0.001

Logic transistors per chip

(in millions)

100,000

10,000

1000

100

10

1

0.1

0.01

Productivity(K) Trans./Staff-Mo.

1981

1983

1985

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

IC capacity

productivity

Gap

Design productivity gap

Source: ITRS’99

Moore’s

Law

Page 23: Improving Embedded System Software Speed and Energy using Microprocessor/FPGA Platform ICs

Frank Vahid, UC Riverside 23

Single-Chip Platforms with On-Chip FPGAs

0

10

20

30

40

50

60

70

1 2 3 4

Volume

Cost

per

IC

199020002010Mainstream

design

Becoming out of reach of mainstream

designers

So, big FPGAs on-chip are O.K., because mainstream designers couldn’t have used all that silicon area anyways

But, couldn’t designers use custom logic instead of FPGAs to make smaller chips and save costs?

Page 24: Improving Embedded System Software Speed and Energy using Microprocessor/FPGA Platform ICs

Frank Vahid, UC Riverside 24

Shrinking Chips

Yes, but there’s a limit Chips becoming pin

limited

Pads connecting to external pins

A football huddle can only get so small

This area will exist whether we use it all or not

Shrink

Page 25: Improving Embedded System Software Speed and Energy using Microprocessor/FPGA Platform ICs

Frank Vahid, UC Riverside 25

Trend Towards Pre-Fabricated Platforms: ASSPs

ASSP: application specific standard product

Domain-specific pre-fabricated IC

e.g., digital camera IC ASIC: application specific IC ASSP revenue > ASIC ASSP design starts > ASIC

Unique IC design Ignores quantity of same IC

ASIC design starts decreasing Due to strong benefits of

using pre-fabricated devices

Sourc

e:

Gart

ner/

Data

quest

Septe

mber’

01

Page 26: Improving Embedded System Software Speed and Energy using Microprocessor/FPGA Platform ICs

Frank Vahid, UC Riverside 26

Microprocessor/FPGA Platforms

Trends point towards such platforms increasing in popularity

Can we automatically partition the software to utilize the FPGA? For improved speed and

energy

Page 27: Improving Embedded System Software Speed and Energy using Microprocessor/FPGA Platform ICs

Frank Vahid, UC Riverside 27

Automatic Hardware/Software Partitioning

Since late 1980s – goal has been spec in, hw/sw out But no successful commercial tool yet. Why?

// From MediaBench’s JPEG codec

GLOBAL(void)

jpeg_fdct_ifast (DCTELEM * data)

{

DCTELEM tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7;

DCTELEM tmp10, tmp11, tmp12, tmp13;

DCTELEM z1, z2, z3, z4, z5, z11, z13;

DCTELEM *dataptr;

int ctr;

SHIFT_TEMPS

/* Pass 1: process rows. */

dataptr = data;

for (ctr = DCTSIZE-1; ctr >= 0; ctr--) {

tmp0 = dataptr[0] + dataptr[7];

tmp7 = dataptr[0] - dataptr[7];

tmp1 = dataptr[1] + dataptr[6];

// Thousands of lines like this in dozens of files

Software

Hardware

“Spec”

Partitioner

Processor ASIC/FPGA

Compilation

Synthesis

Software

Ideal

Page 28: Improving Embedded System Software Speed and Energy using Microprocessor/FPGA Platform ICs

Frank Vahid, UC Riverside 28

Why No Successful Tool Yet?

Most research has focused on extensive exploration Roots in VLSI CAD Decompose problem into

fine-grained operations Apply sophisticated

partitioning algorithms Examples

Min-cut, dynamic programming, simulated annealing, tabu-search, genetic evolution, etc.

Is this overkill?

1000s of nodes (like

circuit partitioning)

“Spec”

Partitioner

Page 29: Improving Embedded System Software Speed and Energy using Microprocessor/FPGA Platform ICs

Frank Vahid, UC Riverside 29

We Really Only Need Consider a Few Loops – Due to the 90-10 Rule

Recent appearance of embedded benchmark suites Enables analysis understanding of the real problem We’ve examined UCLA’s MediaBench, Netbench, Motorola’s Powerstone Currently examining EEMBC (embedded equivalent of SPEC)

UCR loop analysis tools based on SimpleScalar and Simics

00.10.20.30.40.50.60.70.80.9

1

1 2 3 4 5 6 7 8 910

// From MediaBench’s JPEG codec

GLOBAL(void)

jpeg_fdct_ifast (DCTELEM * data)

{

DCTELEM tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7;

DCTELEM tmp10, tmp11, tmp12, tmp13;

DCTELEM z1, z2, z3, z4, z5, z11, z13;

DCTELEM *dataptr;

int ctr;

SHIFT_TEMPS

/* Pass 1: process rows. */

dataptr = data;

for (ctr = DCTSIZE-1; ctr >= 0; ctr--) {

tmp0 = dataptr[0] + dataptr[7];

tmp7 = dataptr[0] - dataptr[7];

tmp1 = dataptr[1] + dataptr[6];

Assigned each loop a

number, sorted by fraction of

contribution to total

execution time

Page 30: Improving Embedded System Software Speed and Energy using Microprocessor/FPGA Platform ICs

Frank Vahid, UC Riverside 30

The 90-10 Rule Holds for Embedded Systems

00.10.20.30.40.50.60.70.80.91

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

00.10.20.30.40.50.60.70.80.91

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

0%10%20%30%40%50%60%70%80%90%

100%

1 2 3 4 5 6 7 8 9 10

% execution time

% size of program

In fact, the most

frequent loop alone

took 50% of time, using 1% of code

Page 31: Improving Embedded System Software Speed and Energy using Microprocessor/FPGA Platform ICs

Frank Vahid, UC Riverside 31

So Need We Only Consider the First Few Loops? Not Necessarily

What if programs were self-similar w.r.t. 90-10 rule? Remove most frequent loop – 90-10 rule still hold? Intuition might say yes – remove loop, and we have another

program.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10

Loop

% R

emai

nin

g

Exe

cuti

on

Tim

e

0

500

1000

0 1 2 3 4 5 6 7 8 9 10

Loop

Sp

eed

up

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10

Loop

% R

emai

nin

g

Exe

cuti

on

Tim

e

0

500

1000

0 1 2 3 4 5 6 7 8 9 10

Loop

Sp

eed

up

So we need only speedup the first few loops

After that, speedups are limited

Good from tool perspective!

Page 32: Improving Embedded System Software Speed and Energy using Microprocessor/FPGA Platform ICs

Frank Vahid, UC Riverside 32

Manually Partitioned Several PowerStone Benchmarks onto Triscend A7 and E5 Chips

Used multimeter and timer to measure performance and power Obtained good speedups and energy

savings by partitioning software among microprocessor and on-chip FPGA

A7 results

Benchmark Timeorig Timesw/hw Sp. Porig Psw/hw Eorig Esw/hw E sav

PS_g3fax 11.47 7.44 1.5 1.320 1.332 15.140 9.910 35%PS_crc 10.92 4.51 2.4 1.320 1.320 14.414 5.953 59%PS_brev 9.84 3.28 3.0 1.332 1.344 13.107 4.408 66%

Average: 2.3 Average: 53%

E5 results

Benchmark Timeorig Timesw/hw Sp. Porig Psw/hw Eorig Esw/hw E sav

PS_g3fax 15.16 7.11 2.1 0.252 0.270 3.820 1.920 50%PS_crc 10.64 4.64 2.3 0.207 0.225 2.202 1.044 53%PS_brev 17.81 1.81 9.8 0.252 0.270 4.488 0.489 89%

Average: 4.8 Average: 64%

E5 IC

Triscend A7 development

board

Page 33: Improving Embedded System Software Speed and Energy using Microprocessor/FPGA Platform ICs

Frank Vahid, UC Riverside 33

Simulation-Based Results for More Benchmarks

Example Archit Cyclesorig Cyclessw Cycleshw

Loop Sp. Clkhw

Total Sp. Psw Phw Eorig Esw/hw ESav

PS_g3fax 8051 19,675,456 10,812,544 176,562 61 25 2.2 0.05 0.032 0.1142 0.05408 53%PS_crc 8051 291,196 180,224 7,168 25 25 2.5 0.05 0.028 0.0017 0.00071 58%PS_summin 8051 109,821,892 20,394,080 384,416 53 25 1.2 0.05 0.033 0.6376 0.53657 16%PS_brev 8051 330,064 305,768 1,360 225 25 12.9 0.05 0.034 0.0019 0.00015 92%PS_matmul 8051 119,420 101,576 2,560 40 25 5.9 0.05 0.035 0.0007 0.00012 82%PS_g3fax MIPS 15,600,000 4,720,000 599,000 8 100 1.4 0.07 0.111 0.0265 0.02163 18%PS_adpcm MIPS 113,000 29,300 5,440 5 100 1.3 0.07 0.181 0.0002 0.00018 6%PS_crc MIPS 5,040,000 3,480,000 460,800 8 100 2.5 0.07 0.061 0.0086 0.00379 56%PS_des MIPS 142,000 70,700 15,100 5 100 1.6 0.07 0.197 0.0002 0.00019 20%PS_engine MIPS 915,000 145,000 28,100 5 100 1.1 0.07 0.082 0.0016 0.00146 6%PS_jpeg MIPS 7,900,000 646,000 171,000 4 100 1.1 0.07 0.092 0.0134 0.01360 -1%PS_summin MIPS 2,920,000 1,270,000 266,000 5 100 1.5 0.07 0.111 0.0050 0.00375 24%PS_v42 MIPS 3,850,000 846,000 216,000 4 100 1.2 0.07 0.102 0.0065 0.00605 7%PS_brev MIPS 3,566 2,499 138 18 100 3.0 0.07 0.107 0.0000 0.00000 62%MB_g721 MIPS 838,230,002 457,674,179 9,985,261 46 100 2.1 0.07 0.152 1.4250 0.75035 47%MB_adpcm MIPS 32,894,094 32,866,110 1,183,260 28 42 11.6 0.07 0.130 0.0559 0.00821 85%MB_pegwit MIPS 42,752,919 33,276,287 2,167,651 15 50 3.1 0.07 0.170 0.0727 0.03241 55%NB_dh MIPS 1,793,032,157 1,349,063,192 45,156,767 30 69 3.5 0.07 0.121 3.0482 1.00547 67%NB_md5 MIPS 5,374,034 3,046,881 289,877 11 47 1.8 0.07 0.251 0.0091 0.00722 21%NB_tl MIPS 57,412,470 29,244,221 2,479,552 12 58 1.8 0.07 0.059 0.0976 0.05930 39%

Average: 30 3.2 Average: 34%

Speedup of 3.2 and energy savings of 34% obtained with only 10,500 gates (avg)

(Quicker than physical implementation, results matched reasonably well)

Page 34: Improving Embedded System Software Speed and Energy using Microprocessor/FPGA Platform ICs

Frank Vahid, UC Riverside 34

Looking at Multiple Loops per Benchmark

Manually created several partitioned versions of each benchmarks Most speedup gained with first 20,000 gates Surprisingly few gates!

Stitt, Grattan and Vahid, Field-programmable Custom Computing Machines (FCCM) 2002

Stitt and Vahid, IEEE Design and Test, Dec. 2002 J. Villarreal, D. Suresh, G. Stitt, F. Vahid and W. Najjar, Design Automation of

Embedded Systems, 2002 (to appear).

1.0

2.0

3.0

4.0

5.0

0 5,000 10,000 15,000 20,000 25,000

Gates

Sp

ee

du

p

G721(MB)

ADPCM(MB)

PEGWIT(MB)

DH(NB)

MD5(NB)

TL(NB)

URL(NB)

27.2

2.05 at 90,000

Page 35: Improving Embedded System Software Speed and Energy using Microprocessor/FPGA Platform ICs

Frank Vahid, UC Riverside 35

Ideal Speedups for Different Architectures

123456789

10

0 1 2 3 4 5 6 7 8 9 100 1 2 3 4 5 6 7 8 9 100 1 2 3 4 5 6 7 8 9 100 1 2 3 4 5 6 7 8 9 100 1 2 3 4 5 6 7 8 9 10

123456789

10

0 1 2 3 4 5 6 7 8 9 100 1 2 3 4 5 6 7 8 9 100 1 2 3 4 5 6 7 8 9 100 1 2 3 4 5 6 7 8 9 100 1 2 3 4 5 6 7 8 9 10

Varied loop speedup ratio (sw time / hw time of loop itself) to see impact of faster microprocessor or slower FPGA: 30, 20, 10 (base case), 5 and 2

Loop speedups of 5 or more work fine for first few loops, not hard to achieve

Page 36: Improving Embedded System Software Speed and Energy using Microprocessor/FPGA Platform ICs

Frank Vahid, UC Riverside 36

Ideal Energy Savings for Different Architectures

Varied loop power ratio (FPGA power / microprocessor power) to account for different architectures – 2.5, 2.0, 1.5 (base case), 1.0

Energy savings quite resilient to variations

00.10.20.30.40.50.60.70.80.91

012345678910012345678910012345678910012345678910012345678910

00.10.20.30.40.50.60.70.80.91

012345678910012345678910012345678910012345678910012345678910

Page 37: Improving Embedded System Software Speed and Energy using Microprocessor/FPGA Platform ICs

Frank Vahid, UC Riverside 37

How is Automated Partitioning Done?

Sw spec

Sw design Hw design

Hw spec

Processor + FPGA

ASIC

System Partitioning

Informal spec

Partitioning

Previous data

obtained manually

Page 38: Improving Embedded System Software Speed and Energy using Microprocessor/FPGA Platform ICs

Frank Vahid, UC Riverside 38

Source-Level Partitioning

Front-end converts code into intermediate format, such as SUIF (Stanford University Intermediate Format)

Intermediate format explored for hardware candidates

Binary is generated from assembling and linking. Hw source is generated and synthesized into netlist

SW Source_____________________

Compiler Front-End

Hw/Sw Partitioning

Compiler Back-End

Hw source

Assembler & Linker

Synthesis

Assembly &

object files

Binary Netlists

Processor FPGA

Page 39: Improving Embedded System Software Speed and Energy using Microprocessor/FPGA Platform ICs

Frank Vahid, UC Riverside 39

Problems with Source-Level Partitioning

Though technically superior, source-level partitioning Disrupts standard commercial tool flow significantly

Requires special compiler (ouch!) Multiple source languages, changing source languages How deal with library code, assembly code, object code

C SUIF Compiler ?

C Source C++ Source Java Source

C++ SUIF Compiler

Compiler Front-end

Page 40: Improving Embedded System Software Speed and Energy using Microprocessor/FPGA Platform ICs

Frank Vahid, UC Riverside 40

Binary PartitioningSW Source

_____________________

Compilation

Hw/Sw Partitioning

Hw source

Assembler & Linker

Synthesis

Assembly &

object files

Binary

Netlists

Processor FPGA

Updated Binary

Source code is first compiled and linked in order to create a binary.

Candidate hardware regions (a few small, frequent loops) are decompiled for partitioning

HDL is generated and synthesized, and binary is updated to use hardware

Page 41: Improving Embedded System Software Speed and Energy using Microprocessor/FPGA Platform ICs

Frank Vahid, UC Riverside 41

Binary-Level Partitioning Results (ICCAD’02)

• Source-Level• Average speedup,

1.5• Average energy

savings, 27%• Average 4,361 gates

• Binary-Level• Average speedup, 1.4• Average energy

savings, 13%• Large area overhead

averaging 10,325 gates

Page 42: Improving Embedded System Software Speed and Energy using Microprocessor/FPGA Platform ICs

Frank Vahid, UC Riverside 42

Binary Partitioning Could Eventually Lead to Dynamic Hw/Sw Partitioning

Dynamic software optimization gaining interest

e.g., HP’s Dynamo What better optimization than

moving to FPGA? Add component on-chip:

Detects most frequent sw loops

Decompiles a loop Performs compiler

optimizations Synthesizes to a netlist Places and routes the netlist

onto (simple) FPGA Updates sw to call FPGA

Config. Logic

MemProcessor

DMA

D$

I$

Profiler

Proc.

Mem

Self-improving IC Can be invisible to designer Appears as efficient

processor HARD! Much future work.

Page 43: Improving Embedded System Software Speed and Energy using Microprocessor/FPGA Platform ICs

Frank Vahid, UC Riverside 43

Conclusions

Hardware/software partitioning can significantly improve software speed and energy

Single-chip microprocessor/FPGA platforms, increasing in popularity, make such partitioning even more attractive

Successful commercial tool still on the horizon Binary-level partitioning may help in some cases Source-level can yield massive parallelism (Profs. Najjar/Payne) Future dynamic hw/sw partitioning possible?

Distinction between sw/hw continually being blurred! Many people involved:

Greg Stitt, Roman Lysecky, Shawn Nematbakhsh, Dinesh Suresh, Walid Najjar, Jason Villarreal, Tom Payne, several others…

Support from NSF, Triscend, and soon SRC… Exciting new directions!