improving embedded system software speed and energy using microprocessor/fpga platform ics

Frank Vahid, UC Riverside

1

Improving Embedded System Software Speed and Energy usingMicroprocessor/FPGA Platform ICs

Frank VahidAssociate Professor

Dept. of Computer Science and EngineeringUniversity of California, Riverside

Also with the Center for Embedded Computer Systems at UC Irvine

http://www.cs.ucr.edu/~vahid

This research has been supported by the National Science Foundation, NEC, Trimedia, and Triscend

Frank Vahid, UC Riverside 2

General Purpose vs. Special Purpose

Standard tradeoff

Oct. 14, 2002, Cincinnati, Ohio -- physician at Cincinnati Children’s Hospital Medical Center report duct tape effective at treating warts.

Amazing to think this came from wolves


General Purpose vs. Single Purpose Processors

Designers have long known that:

General-purpose processors are flexible

Single-purpose processors are fast

ENIAC, 1940’sIts flexibility was the big deal

DatapathController

Controllogic

State register

Datamemory

i

total

+

IR PC

Registerfile

GeneralALU

DatapathController

Program memory

Assembly code for:

total = 0 for i =1 to

…

Controllogic and

State register

Datamemory

total = 0for i = 1 to N loop total += M[i]end loop

General purpose

Single purposeOR

FlexibilityDesign cost

Time-to-market

PerformancePower efficiency

Size


Mixing General and Single Purpose Processors

A.k.a. Hardware/software partitioning

Hardware: single-purpose processors

coprocessor, accelerator, peripheral, etc.

Software: general-purpose processors

Though hardware underneath!

Especially important for embedded systems

Computers embedded in devices (cameras, cars, toys, even people)

Speed, cost, time-to-market, power, size, … demands are tough

Microcontroller

CCD preprocessor Pixel coprocessorA2D

D2A

JPEG codec

DMA controller

Memory controller ISA bus interface UART LCD control

Display control

Multiplier/Accumulator

Digital camera chip

lens

CCD

Sw only Hw Only Hw/ Sw

FlexibilitySpeedEnergy


How is Partitioning Done for Embedded Systems?

Partitioning into hw and sw blocks done early

During conceptual stage Sw design done separately

from hw design Attempts since late 1980s to

automate not yet successful Partitioning manually is

reasonably straightforward Spec is informal and not

machine readable Sw algorithms may differ from

hw algorithms No compelling need for tools

Informal spec

Sw spec

Sw design Hw design

Hw spec

Processor

ASIC

System Partitioning


New Platforms Invite New Efforts in Hw/Sw Partitioning

New single-chip platforms contain both general-purpose processor and an FPGA FPGA: Field-programmable

gate array Programmable just like

software Flexible Intended largely to implement

single-purpose processors Can we perform a later

partitioning to improve the software too?

Sw spec

Sw design Hw design

Hw spec

Processor + FPGA

ASIC

System Partitioning

Informal spec

Processor + FPGA

Partitioning


Commercial Single-Chip Microprocessor/FPGA Platforms

Triscend E5 chip

Con

fig

ura

ble

log

ic8051 processor plus other peripherals

Memory

Triscend E5: based on 8-bit 8051 CISC core (2000) 10 Dhrystone MIPS at

40MHz up to 40K logic gates Cost only about $4


Single-Chip Microprocessor/FPGA Platforms

Atmel FPSLIC Field-Programmable

System-Level IC Based on AVR 8-bit

RISC core 20 Dhrystone MIPS 5k-40k logic gates $5-$10

Courtesy of Atmel



Triscend A7 chip (2001)

Based on ARM7 32-bit RISC processor 54 Dhrystone MIPS

at 60 MHz Up to 40k logic

gates $10-$20 in volume

Courtesy of Triscend



Altera’s Excalibur EPXA 10 (2002)

ARM (922T) hard core 200 Dhrystone MIPS at

200 MHz ~200k to ~2 million

logic gates

Source: www.altera.com



Xilinx Virtex II Pro (2002)

PowerPC based 420 Dhrystone MIPS

at 300 MHz 1 to 4 PowerPCs 4 to 16 gigabit

transceivers 12 to 216 multipliers Millions of logic gates 200k to 4M bits RAM 204 to 852 I/O $100-$500 (>25,000

units)

Con

fig

.lo

gic

Up to 16 serial transceivers• 622 Mbps to 3.125 Gbps622 Mbps to 3.125 Gbps

Pow

erP

Cs

Courtesy of Xilinx


Why wouldn’t future microprocessor chips include some amount of on-chip FPGA?


One argument against – area Lots of silicon area taken up by

FPGA FPGA about 20-30 times less

area efficient than custom logic

FPGA used to be for prototyping, too big for final products

But chip trends imply that FPGAs will be O.K. in final products…


How Much is Enough?

Perhaps a bit small


How Much is Enough?

Reasonably sized


How Much is Enough?

Probably plenty big for most of us


How Much is Enough?

More than typically necessary


How Much Custom Logic is Enough?

1993: ~ 1 million logic transistors

IC package IC

Perhaps a bit small

8-bit processor: 50,000 tr.Pentium: 3 million tr.

MPEG decoder: several million tr.


1996: ~ 5-8 million logic transistors

Reasonably sized




Probably plenty big for most of us




More than typically necessary



2008: >1 BILLION logic transistors

1993: 1 M

Perhaps very few people

could design this



Very Few Companies Can Design High-End ICs

Designer productivity growing at slower rate 1981: 100 designer months ~$1M 2002: 30,000 designer months ~$300M

10,000

1,000

100

10

1

0.1

0.01

0.001

Logic transistors per chip

(in millions)

100,000

10,000

1000

100

10

1

0.1

0.01

Productivity(K) Trans./Staff-Mo.

1981

1983

1985

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

IC capacity

productivity

Gap

Design productivity gap

Source: ITRS’99

Moore’s

Law


Single-Chip Platforms with On-Chip FPGAs

0

10

20

30

40

50

60

70

1 2 3 4

Volume

Cost

per

IC

199020002010Mainstream

design

Becoming out of reach of mainstream

designers

So, big FPGAs on-chip are O.K., because mainstream designers couldn’t have used all that silicon area anyways

But, couldn’t designers use custom logic instead of FPGAs to make smaller chips and save costs?


Shrinking Chips

Yes, but there’s a limit Chips becoming pin

limited

Pads connecting to external pins

A football huddle can only get so small

This area will exist whether we use it all or not

Shrink


Trend Towards Pre-Fabricated Platforms: ASSPs

ASSP: application specific standard product

Domain-specific pre-fabricated IC

e.g., digital camera IC ASIC: application specific IC ASSP revenue > ASIC ASSP design starts > ASIC

Unique IC design Ignores quantity of same IC

ASIC design starts decreasing Due to strong benefits of

using pre-fabricated devices

Sourc

e:

Gart

ner/

Data

quest

Septe

mber’

01


Microprocessor/FPGA Platforms

Trends point towards such platforms increasing in popularity

Can we automatically partition the software to utilize the FPGA? For improved speed and

energy


Automatic Hardware/Software Partitioning

Since late 1980s – goal has been spec in, hw/sw out But no successful commercial tool yet. Why?

// From MediaBench’s JPEG codec

GLOBAL(void)

jpeg_fdct_ifast (DCTELEM * data)

{

DCTELEM tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7;

DCTELEM tmp10, tmp11, tmp12, tmp13;

DCTELEM z1, z2, z3, z4, z5, z11, z13;

DCTELEM *dataptr;

int ctr;

SHIFT_TEMPS

/* Pass 1: process rows. */

dataptr = data;

for (ctr = DCTSIZE-1; ctr >= 0; ctr--) {

tmp0 = dataptr[0] + dataptr[7];

tmp7 = dataptr[0] - dataptr[7];


…

// Thousands of lines like this in dozens of files

Software

Hardware

“Spec”

Partitioner

Processor ASIC/FPGA

Compilation

Synthesis

Software

Ideal


Why No Successful Tool Yet?

Most research has focused on extensive exploration Roots in VLSI CAD Decompose problem into

fine-grained operations Apply sophisticated

partitioning algorithms Examples

Min-cut, dynamic programming, simulated annealing, tabu-search, genetic evolution, etc.

Is this overkill?

1000s of nodes (like

circuit partitioning)

“Spec”

Partitioner


We Really Only Need Consider a Few Loops – Due to the 90-10 Rule

Recent appearance of embedded benchmark suites Enables analysis understanding of the real problem We’ve examined UCLA’s MediaBench, Netbench, Motorola’s Powerstone Currently examining EEMBC (embedded equivalent of SPEC)

UCR loop analysis tools based on SimpleScalar and Simics

00.10.20.30.40.50.60.70.80.9

1

1 2 3 4 5 6 7 8 910

// From MediaBench’s JPEG codec

GLOBAL(void)

jpeg_fdct_ifast (DCTELEM * data)

{

DCTELEM tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7;

DCTELEM tmp10, tmp11, tmp12, tmp13;

DCTELEM z1, z2, z3, z4, z5, z11, z13;

DCTELEM *dataptr;

int ctr;

SHIFT_TEMPS

/* Pass 1: process rows. */

dataptr = data;

for (ctr = DCTSIZE-1; ctr >= 0; ctr--) {


tmp7 = dataptr[0] - dataptr[7];


…

Assigned each loop a

number, sorted by fraction of

contribution to total

execution time


The 90-10 Rule Holds for Embedded Systems

00.10.20.30.40.50.60.70.80.91

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

00.10.20.30.40.50.60.70.80.91

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

0%10%20%30%40%50%60%70%80%90%

100%

1 2 3 4 5 6 7 8 9 10

% execution time

% size of program

In fact, the most

frequent loop alone

took 50% of time, using 1% of code


So Need We Only Consider the First Few Loops? Not Necessarily

What if programs were self-similar w.r.t. 90-10 rule? Remove most frequent loop – 90-10 rule still hold? Intuition might say yes – remove loop, and we have another

program.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10

Loop

% R

emai

nin

g

Exe

cuti

on

Tim

e

0

500

1000

0 1 2 3 4 5 6 7 8 9 10

Loop

Sp

eed

up

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10

Loop

% R

emai

nin

g

Exe

cuti

on

Tim

e

0

500

1000

0 1 2 3 4 5 6 7 8 9 10

Loop

Sp

eed

up

So we need only speedup the first few loops

After that, speedups are limited

Good from tool perspective!


Manually Partitioned Several PowerStone Benchmarks onto Triscend A7 and E5 Chips

Used multimeter and timer to measure performance and power Obtained good speedups and energy

savings by partitioning software among microprocessor and on-chip FPGA

A7 results

Benchmark Timeorig Timesw/hw Sp. Porig Psw/hw Eorig Esw/hw E sav

PS_g3fax 11.47 7.44 1.5 1.320 1.332 15.140 9.910 35%PS_crc 10.92 4.51 2.4 1.320 1.320 14.414 5.953 59%PS_brev 9.84 3.28 3.0 1.332 1.344 13.107 4.408 66%

Average: 2.3 Average: 53%

E5 results

Benchmark Timeorig Timesw/hw Sp. Porig Psw/hw Eorig Esw/hw E sav

PS_g3fax 15.16 7.11 2.1 0.252 0.270 3.820 1.920 50%PS_crc 10.64 4.64 2.3 0.207 0.225 2.202 1.044 53%PS_brev 17.81 1.81 9.8 0.252 0.270 4.488 0.489 89%

Average: 4.8 Average: 64%

E5 IC

Triscend A7 development

board


Simulation-Based Results for More Benchmarks

Example Archit Cyclesorig Cyclessw Cycleshw

Loop Sp. Clkhw

Total Sp. Psw Phw Eorig Esw/hw ESav

PS_g3fax 8051 19,675,456 10,812,544 176,562 61 25 2.2 0.05 0.032 0.1142 0.05408 53%PS_crc 8051 291,196 180,224 7,168 25 25 2.5 0.05 0.028 0.0017 0.00071 58%PS_summin 8051 109,821,892 20,394,080 384,416 53 25 1.2 0.05 0.033 0.6376 0.53657 16%PS_brev 8051 330,064 305,768 1,360 225 25 12.9 0.05 0.034 0.0019 0.00015 92%PS_matmul 8051 119,420 101,576 2,560 40 25 5.9 0.05 0.035 0.0007 0.00012 82%PS_g3fax MIPS 15,600,000 4,720,000 599,000 8 100 1.4 0.07 0.111 0.0265 0.02163 18%PS_adpcm MIPS 113,000 29,300 5,440 5 100 1.3 0.07 0.181 0.0002 0.00018 6%PS_crc MIPS 5,040,000 3,480,000 460,800 8 100 2.5 0.07 0.061 0.0086 0.00379 56%PS_des MIPS 142,000 70,700 15,100 5 100 1.6 0.07 0.197 0.0002 0.00019 20%PS_engine MIPS 915,000 145,000 28,100 5 100 1.1 0.07 0.082 0.0016 0.00146 6%PS_jpeg MIPS 7,900,000 646,000 171,000 4 100 1.1 0.07 0.092 0.0134 0.01360 -1%PS_summin MIPS 2,920,000 1,270,000 266,000 5 100 1.5 0.07 0.111 0.0050 0.00375 24%PS_v42 MIPS 3,850,000 846,000 216,000 4 100 1.2 0.07 0.102 0.0065 0.00605 7%PS_brev MIPS 3,566 2,499 138 18 100 3.0 0.07 0.107 0.0000 0.00000 62%MB_g721 MIPS 838,230,002 457,674,179 9,985,261 46 100 2.1 0.07 0.152 1.4250 0.75035 47%MB_adpcm MIPS 32,894,094 32,866,110 1,183,260 28 42 11.6 0.07 0.130 0.0559 0.00821 85%MB_pegwit MIPS 42,752,919 33,276,287 2,167,651 15 50 3.1 0.07 0.170 0.0727 0.03241 55%NB_dh MIPS 1,793,032,157 1,349,063,192 45,156,767 30 69 3.5 0.07 0.121 3.0482 1.00547 67%NB_md5 MIPS 5,374,034 3,046,881 289,877 11 47 1.8 0.07 0.251 0.0091 0.00722 21%NB_tl MIPS 57,412,470 29,244,221 2,479,552 12 58 1.8 0.07 0.059 0.0976 0.05930 39%

Average: 30 3.2 Average: 34%

Speedup of 3.2 and energy savings of 34% obtained with only 10,500 gates (avg)

(Quicker than physical implementation, results matched reasonably well)


Looking at Multiple Loops per Benchmark

Manually created several partitioned versions of each benchmarks Most speedup gained with first 20,000 gates Surprisingly few gates!

Stitt, Grattan and Vahid, Field-programmable Custom Computing Machines (FCCM) 2002

Stitt and Vahid, IEEE Design and Test, Dec. 2002 J. Villarreal, D. Suresh, G. Stitt, F. Vahid and W. Najjar, Design Automation of

Embedded Systems, 2002 (to appear).

1.0

2.0

3.0

4.0

5.0

0 5,000 10,000 15,000 20,000 25,000

Gates

Sp

ee

du

p

G721(MB)

ADPCM(MB)

PEGWIT(MB)

DH(NB)

MD5(NB)

TL(NB)

URL(NB)

27.2

2.05 at 90,000


Ideal Speedups for Different Architectures

123456789

10

0 1 2 3 4 5 6 7 8 9 100 1 2 3 4 5 6 7 8 9 100 1 2 3 4 5 6 7 8 9 100 1 2 3 4 5 6 7 8 9 100 1 2 3 4 5 6 7 8 9 10

123456789

10

0 1 2 3 4 5 6 7 8 9 100 1 2 3 4 5 6 7 8 9 100 1 2 3 4 5 6 7 8 9 100 1 2 3 4 5 6 7 8 9 100 1 2 3 4 5 6 7 8 9 10

Varied loop speedup ratio (sw time / hw time of loop itself) to see impact of faster microprocessor or slower FPGA: 30, 20, 10 (base case), 5 and 2

Loop speedups of 5 or more work fine for first few loops, not hard to achieve


Ideal Energy Savings for Different Architectures

Varied loop power ratio (FPGA power / microprocessor power) to account for different architectures – 2.5, 2.0, 1.5 (base case), 1.0

Energy savings quite resilient to variations

00.10.20.30.40.50.60.70.80.91

012345678910012345678910012345678910012345678910012345678910

00.10.20.30.40.50.60.70.80.91

012345678910012345678910012345678910012345678910012345678910


How is Automated Partitioning Done?

Sw spec

Sw design Hw design

Hw spec

Processor + FPGA

ASIC

System Partitioning

Informal spec

Partitioning

Previous data

obtained manually


Source-Level Partitioning

Front-end converts code into intermediate format, such as SUIF (Stanford University Intermediate Format)

Intermediate format explored for hardware candidates

Binary is generated from assembling and linking. Hw source is generated and synthesized into netlist

SW Source_____________________

Compiler Front-End

Hw/Sw Partitioning

Compiler Back-End

Hw source

Assembler & Linker

Synthesis

Assembly &

object files

Binary Netlists

Processor FPGA


Problems with Source-Level Partitioning

Though technically superior, source-level partitioning Disrupts standard commercial tool flow significantly

Requires special compiler (ouch!) Multiple source languages, changing source languages How deal with library code, assembly code, object code

C SUIF Compiler ?

C Source C++ Source Java Source

C++ SUIF Compiler

Compiler Front-end


Binary PartitioningSW Source

_____________________

Compilation

Hw/Sw Partitioning

Hw source

Assembler & Linker

Synthesis

Assembly &

object files

Binary

Netlists

Processor FPGA

Updated Binary

Source code is first compiled and linked in order to create a binary.

Candidate hardware regions (a few small, frequent loops) are decompiled for partitioning

HDL is generated and synthesized, and binary is updated to use hardware


Binary-Level Partitioning Results (ICCAD’02)

• Source-Level• Average speedup,

1.5• Average energy

savings, 27%• Average 4,361 gates

• Binary-Level• Average speedup, 1.4• Average energy

savings, 13%• Large area overhead

averaging 10,325 gates


Binary Partitioning Could Eventually Lead to Dynamic Hw/Sw Partitioning

Dynamic software optimization gaining interest

e.g., HP’s Dynamo What better optimization than

moving to FPGA? Add component on-chip:

Detects most frequent sw loops

Decompiles a loop Performs compiler

optimizations Synthesizes to a netlist Places and routes the netlist

onto (simple) FPGA Updates sw to call FPGA

Config. Logic

MemProcessor

DMA

D$

I$

Profiler

Proc.

Mem

Self-improving IC Can be invisible to designer Appears as efficient

processor HARD! Much future work.


Conclusions

Hardware/software partitioning can significantly improve software speed and energy

Single-chip microprocessor/FPGA platforms, increasing in popularity, make such partitioning even more attractive

Successful commercial tool still on the horizon Binary-level partitioning may help in some cases Source-level can yield massive parallelism (Profs. Najjar/Payne) Future dynamic hw/sw partitioning possible?

Distinction between sw/hw continually being blurred! Many people involved:

Greg Stitt, Roman Lysecky, Shawn Nematbakhsh, Dinesh Suresh, Walid Najjar, Jason Villarreal, Tom Payne, several others…

Support from NSF, Triscend, and soon SRC… Exciting new directions!

improving embedded system software speed and energy using microprocessor/fpga platform ics

Documents