lecture 10b: implementing dsp functionality:...

79
1 Kurt Keutzer Lecture 10b: Implementing DSP Functionality: Alternatives Prepared by: Professor Kurt Keutzer Computer Science 252, Spring 2000 With contributions from: Prof. Heinrich Meyr, University of Aachen Philip Chong, David Chinnery, Rhett Davis, Paul Husted, Niraj Shah, Chris Taylor, Scott Weber, Ning Zhang

Upload: others

Post on 12-Mar-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

1Kurt Keutzer

Lecture 10b: Implementing DSP Functionality:

Alternatives

Prepared by: Professor Kurt Keutzer

Computer Science 252, Spring 2000

With contributions from:

Prof. Heinrich Meyr, University of Aachen

Philip Chong, David Chinnery, Rhett Davis, Paul Husted,

Niraj Shah, Chris Taylor, Scott Weber, Ning Zhang

2Kurt Keutzer

System Implementation Choices

DSP Core

ProgramROM

CoefficientROM

Control

EMBEDDEDCORE µP/DSP

OFF-THESHELF µP/

DSP

DSP

APPLICATIONSPECIFIC µP(ASIP)

ASIC

System FunctionalitySystem Functionality

ASIP Core

ProgramROM

CoefficientROM

Control

3Kurt Keutzer

Making a Successful Comparison - 1

Find an interesting application kernel

� viterbi decoding for speech processing (not a full modem!)

Find realistic constraints native to the application

� n=2, K=7, QPSK, 100KBS, BER= 10^-4

Find architectures/implementations that are promising for the application

� TI TMS320C54, Tensilica Xtensa

� What are the relevant features of this architecture that support this application?

Fix application constraints across all implementations (above)

Fix key parameters for implementation comparison

� performance (constraint)

� area

� power

4Kurt Keutzer

Making a Successful Comparison - 2

Identify how key parameters will be measured

� performance - instruction set simulator, eval board

� area - data sheets, gate estimates

� power - eval board, TI application note

Implement your application kernel

� Examine different algorithms

� Start with code downloaded from the web - multimedia benchmarks etc.

� Build your software development/evaluation environment:� http://www.ti.com/sc/docs/tools/dsp/6ccsfreetool.htm

5Kurt Keutzer

Making a Successful Comparison - 3

Implement your application kernel (cont)

� Phase 0: Research� Find application notes, research reports for your own or

comparable architectures� Phase 1: Estimation

� Develop a quick estimate based on initial code� Integrate research findings� Do a quick back-of-envelope reality check

� Phase 2: Real implementation/Tuning� Tailor algorithm, implementation to architecture� Do your very best! Have a contest with your partner

� Phase 3: Evaluation� Apply evaluation tools to key parameters� Evaluate and compare results - return to 2

If your life depended on choosing the right part - what would you do?

6Kurt Keutzer

Making a Successful Comparison - 4

Final evaluation and comparison - compare all implementations

To evaluate for a product - everything is fair game

To evaluate principally the architectures - need to consider:

� Fab differences - TSMC vs. IBM (10-20% faster)

� process differences - .35 micron vs. .25 (50% faster)

� power supply differences 3.0V vs. 1.5V

� asic vs. custom implementations - (2x faster)

Now evaluate - if I was the architect of this processor/implementor of this system on a chip, what would I do differently?

� cache sizes

� register availability

� additional instructions

� on chip memory

7Kurt Keutzer

Making a Successful Comparison - 5

Just for fun …

In addition to primary constraints (speed, cost, power)

final real world considerations

� business relationships (joint partnership with Lucent)

� Time-to-market issues � time to configure?� software development environment� library/application software support� application engineering support

8Kurt Keutzer

Viterbi Algorithm

Prof. Heinrich Meyr

University of Aachen

9Kurt Keutzer

Viterbi Decoders in digital communication systems

Signal Source Source CoderConvolutional orTrellis Coder &Mapper

Modulator

Channel

Viterbi DecoderSource Decoder DemodulatorSignal Sink

information bits channel symbols ck

received symbols y kdecoded bits

10Kurt Keutzer

Convolutional Coder and Trellis diagram

0 k k+1 T

x

0

1

2

3

ss 0,k 0,k+1

s s3,k 3,k+1

z -1 z -1

+

+

uk

codesymbols

Mapper

channelsymbols

modulo 2addition

xx1,k 0,k

kyknown startstate X =00 T

additivewhitenoise n

CONVOLUTIONAL CODER

VITERBI DECODER

CHANNEL

kinformationbits

uk-1

uk-2

T-1

BPSK

kc

kb

kb = 1i

kb = 0i

Survivor Memory

known endstate X =0

decoded bits

decisions

11Kurt Keutzer

ACS recursion for M = 2

Max { , }γ (1,i)k

survivor pathcompeting path

γ =(1,i)k

γ Z(0,i),k-1

γ Z(1,i),k-1 λ (1,i)k

γ =i,k

d = 1i,kγ =(1,i)k

γ (0,i)k

+

γ Z(0,i),k-1 λ (0,i)k

γ =(0,i)k +

γ Z(1,i),k-1

12Kurt Keutzer

Viterbi Decoder block diagram

TMU ACSU SMU

Latch

channelsymbols y

k

branchmetrics

statemetrics

k

decisionbits

decodedbits u

13Kurt Keutzer

Characteristic of a 2-bit step-at-zero quantizer

Q=-2

Q=-1

Q=0

Q=1 saturation

saturation-2

-1

1

2

normalizedinputlevel

Interpretation

1 2-1-2

14Kurt Keutzer

Architecture

15Kurt Keutzer

Node parallel ACS architecture

λ (0,i)k

Shuffle-ExchangeNetwork

γ 0,k

γ 1,k

γ N-1,k

λ (1,i)k

ACS

ACS

ACS

0

1

N-1

TMURegister

SMU

decisionsdec(i,k)

16Kurt Keutzer

ACS

ACS

ACS

ACS

M

M

M

M

butterfly butterflysharedACS

sharedACS

Alternative Implementations

17Kurt Keutzer

Butterfly trellis structure and resource sharing for the K = 3, rate 1/2 code

ACS

ACS

ACS

Path metricmemory

ACS

γ 0,k

γ 1,k

γ 3,k

γ 2,k

ACS

ACS

γ 0,k+1

γ 2,k+1

γ 3,k+1

γ 1,k+1

γ 0,k

γ 1,k

γ 3,k

γ 2,k

MUX

MUX

MUX

MUX

oldstatemetrics

newstatemetrics

18Kurt Keutzer

Survivor Memory Unit

19Kurt Keutzer

REA hardware architecture

d

3

0

1

2

d

d

d

0=

0

00

11

11

0 1 D

1

1=

1

1

0

0

0=

0=

PE

3,k

0,k

1,k

2,k

s

s

s

s

u

[1]

k-D

u

[2]

k-D

u

[3]

k-D

u

[0]

k-D

k-1

k-1

k-1

k-1

^

^

^

^

u

[1]u

[2]u

[3]u

[0]^

^

^

^

u(0,0)

u(0,0)

u(1,0)

u(1,3)

k

k

k

k

u

[1]u

[2]u

[3]u

[0]^

^

^

^

u

[1]

k-D+1

u

[2]

k-D+1

u

[3]

k-D+1

u

[0]

k-D+1

^

^

^

^

20Kurt Keutzer

Decoded Sequence: 0 0 ... 0 1 0

Acquisition of final survivorDecoding

10

0

Decoded Sequence : 0 0 ... 0 1 0

00

ku[0]^

k-Du[0]^u[0]^

k-(D + M -1)

21Kurt Keutzer

Viterbi Project Constraints

•uncoded word length = 1

•coded word length (n) = 2

� this means that it is rate 1/2

•constraint length (K aka. L) = 7

� this means that the number of states in trellis is 2^(K-1) or 64 states

•branch metric calculation is QPSK

• soft decision wordlength (q) = 6

•chain-backing depth (D) = 96

•generator polynomials:

� p0 = 171,

� p1= 133 (octal)

� this means that p0=1111001, p1=1011011

• data rate 100 kbs

• goal: bit error rate (BER) = 10^-4

• signal to noise ratio (SNR)

• degradation 0.05dB

22Kurt Keutzer

Viterbi Decoder Implementation on an ARM

EE 290S Final Project

May 4, 1999

Phillip Chong

23Kurt Keutzer

ARM Overview

32-bit RISC microprocessor

Five stage pipeline

Features fast ALU operations (barrel shifter)

Scalar integer unit, no FPU

24Kurt Keutzer

Algorithm Tweaking

Performing the metric computation through table lookup (load = 1 delay slot) is faster than using ALU (multiplication = up to 3 delay slots)

Parity computation (Viterbi code) can also be done through table lookup

25Kurt Keutzer

Reducing Memory Footprint

Cache misses can be very costly due to pipeline stalls

We are willing to give up some algorithmic efficiency to eliminate cache misses

To minimize the memory footprint, we pack 32 bits of traceback into single word; we can easily unpack this data due to the barrel shifter (1 cycle operation)

For 128 level traceback, memory requirements are 512 bytes (metrics table) + 1024 bytes (traceback) + 768 bytes (parity lookup tables) = 2304 bytes

26Kurt Keutzer

Simulation Results

Simulated decoding of 4096 bits on a 125 MHz 3.3V model

Execution requires 11.72M ARM instruction cycles, giving 44 kb/s data rate

Power consumption was estimated at 52.47 mW

Scaling simulation results up to 275 MHz 2.0V ARM (fastest commercially available) gives 96 kb/s at 42.40 mW

27Kurt Keutzer

Summary

Clock speed: 275 MHz

Execution Performance: 96kb/s

Power Dissipation: 42.40 mW (5.68 mW/mm2)

Area: 7.47mm2 in 0.25 µµµµm

Design Effort: 4 days

Portability very high: code is ANSI C; architecture-dependent tweaks may need reworking

28Kurt Keutzer

Conclusion/Thanks

One-bit quantization gives opportunities for performance improvements, at a huge cost in QOR

Viterbi algorithm would benefit greatly from having hardware parallelism (vector ops) available

Many thanks to Marlene Wan for providing power estimation

29Kurt Keutzer

Viterbi Decoder Implementation on a TI C54x

EE 290S Final Project

May 4, 1999

Paul Husted

30Kurt Keutzer

Introduction

Implemented Viterbi Decoder on a TI TMS320VC5402 DSP

Examine:

� Performance (bits/sec)� Power (mW/bit)

� Cost ($/unit,area)

� Design effort (engineer-months)

31Kurt Keutzer

Viterbi Decoder Specifications

Implementation Specifications:

� Constraint Length (K aka. L) = 7

� Branch Metric Calculation is QPSK

� Soft Decision Wordlength (q) = 6

� Chain-backing Depth (D) = 96

� Gen. Polynomials: p0 = 171, p1= 133 (octal)

� Data Rate 100 kbs

� Goal: Bit Error Rate (BER) = 10^-4

32Kurt Keutzer

C54x Capabilities

Capabilities of all C54x DSP Cores:

� Three 16-bit Data, One 16-bit program bus

� 40 bit ACC with 40 bit barrel shifter

� Two independent accumulators

� A single cycle non-pipelined MAC

� Single-instruction repeat and block-repeat

� Six channel DMA controller

� Arithmetic instructions with parallel store and parallel load

33Kurt Keutzer

Helpful Instructions for the ViterbiDecoder

The C54x Has Specialized Instruction Set

� Dual Add/Subtract in 1 Cycle

� Compare, Select, and Store Unit (CSSU)� Compare Branch Metrics� Store Larger Value, Store Decision Bit� Increment Address Registers in Circular Buffer� 1 Cycle

� Allows Butterfly (2 States) in 5 cycles

34Kurt Keutzer

Butterfly Implementation

DADSTCMPS

DSADTCMPS

Old(2*j)

Old(2*j+1)

New(j)

New(j+2(K-2))

T Register = Local Distance

35Kurt Keutzer

TI TMS320VC5402 DSP

Specific Chip Characteristics:

� Operates at 100 MIPS � Core Voltage of 1.8V � I/O Pins Operate at 3.3V

� 16K Word x 16 Bits of Dual-Access RAM

� 4K Word x 16 Bits of ROM

� Internal DMA

� Created in 0.18 Micron Technology

36Kurt Keutzer

Dataflow

Data I/O

� Input Values Assumed to be Placed at Specified Memory Location by Internal DMA

� Output Values Assumed to be removed from another Memory Location by Internal DMA

� Alternatively, Data Could be Placed in this Memory Location After Other On-Chip Receiver Processing

37Kurt Keutzer

Implementation Analysis

Viterbi Decoder Code Created in Assembly

Linked to Processor Specific Memory Map

Simulated on Cycle-Accurate Simulator

� Used Correct Memory Model for VC5402

38Kurt Keutzer

Implementation Results

Estimated ActualCode Size 500

Instructions1032 (16 bit)Words

Data Size 1280 (16 bit)Words

1280 (16 Bit)Words

MIPS(100 Kbps)

18.425 21.53125

Max. Speed(100 MIPS)

582 Kbps 464.7 Kbps

39Kurt Keutzer

Power Calculation

Compared with TI Figures:

� TI uses 1/2 MACs, 1/2 NOPs For Power Figure

� .25 Micron Estimate is .45 mA/MIPS� Fully Static Design can be Clocked at Any Rate

� Viterbi Code Uses 1.08 Times More Current than TI Estimate

At 22 MIPS, 19.25 mW are Consumed in the Core

40Kurt Keutzer

Area Estimate

TI Will Not Release Die Sizes

� .25 Micron Chips Fit Inside 3.2 mm x 3.2 mm Area on a 144 pin BGA

� Maximum Die Size is thus 10.24 mm2

41Kurt Keutzer

Development Cost

Engineering Time

� Estimate - 3 days� Assumes Engineer Has Experience with

Assembly Language and TI Tools

Tool Cost - $13262.45

� Includes Emulator, Simulator, Compiler, Assembler, Linker, Debugger

Cost of Chip - $8.52

42Kurt Keutzer

Conclusion

Optimized Instructions Make Algorithm Efficient

Static Design Allows Clock Rate to be Set As Needed to Reduce Power

Flexibility Exists to Perform Other Processing of Data

Very Little Development Time/Cost

43Kurt Keutzer

ACS TIE Extension with State (ACS)

bm331 24:23 16:15 8:7 0

bm2 bm1 bm0

+

+

17pm- pm-

1127

-=1?

31Rs

msbmsb

+

+

17pm-pm-

11 27

- =1?

31Rt

msbmsb

11pm

310:1decision bitdecision bit

Rr

pm16:17

0:11:0

27

decision bitdecision bit

Control

instruction

44Kurt Keutzer

Tensilica Viterbi Implementation

Niraj Shah

Scott Weber

290A Final Presentation

45Kurt Keutzer

Tensilica Flow

.c

.o xt-run

.c.c

gen uArch Designer

gen

xt-gcc

TIE

TensilicaProcessorGenerator

46Kurt Keutzer

Xtensa Architecture

XtensaCore

Rs Rt RrI

TIE

TIE Extensions:

� single cycle

� state free

� no new exceptions

� no stalls

� typeless data

Rs, Rt, Rr are 32 bit regs

I is the instruction controlling the TIE unit

Xtensa Core is a 32 bit configurable RISC processor

47Kurt Keutzer

Viterbi Architecture

ACS

TraceBackRAMInit

ADC I/0Device

MeasuredMeasuredPerformancePerformance

HereHere

48Kurt Keutzer

TIE SetupBMreg (ACS)

-++

31 8:7 0I

Rs Rt

Rr

31 8:7 0Q

bm33123:2415:167:80

bm2bm1bm0

-

0x7F0x7F

-

Controlinstruction

49Kurt Keutzer

ACS TIE Extension (ACS)

+

+

bm331 24:23 16:15 8:7 0

bm2 bm1 bm017

pm- pm-11 1:027

-=1?

11:12pm

310:10’s

decision bitdecision bit

ACS03 ||ACS12 ||ACS30 ||ACS21

31

instruction

RtRs

Rr

msbmsb

50Kurt Keutzer

ACS TIE Extension with State (ACS)

bm331 24:23 16:15 8:7 0

bm2 bm1 bm0

+

+

17pm- pm-

1127

-=1?

31Rs

msbmsb

+

+

17pm-pm-

11 27

- =1?

31Rt

msbmsb

11pm

310:1decision bitdecision bit

Rr

pm16:17

0:11:0

27

decision bitdecision bit

Control

instruction

51Kurt Keutzer

TIE Zmask (TraceBack)

&

31 1:0Rs Rt

Rr

31 6:5 0

6:70

|

0x7F0x7F

<<1<<1

&0x3F0x3F

31

Controlinstruction

52Kurt Keutzer

Designs

All designs had a BER of 0.000095 after 10 million iterations

Design 1

� 100 MHz, 48 mW, 1K DCache, 1K ICache, TIEDesign 1+

� 222 MHz, 144 mW, 1K DCache, 1K ICache, TIE

Design 2-

� 100 MHz, 69 mW, 16K DCache, 16K ICache, TIE

Design 2

� 222 MHz, 191 mW, 16K DCache, 16K ICache, TIE

Design 3

� 222 MHz, 191 mW, 16K DCAche, 16K ICache, TIE with state

53Kurt Keutzer

Performance

118

409

263

909

357409

793

909966

1142

0

200

400

600

800

1000

1200

Design1

Design1+

Design2-

Design2

Design3

CachePerfect Cache

Kb/sKb/s

54Kurt Keutzer

Energy Dissipation

uJuJ/bit/bit

0.4

0.12

0.54

0.160.19

0.17

0.240.21 0.2

0.17

0

0.1

0.2

0.3

0.4

0.5

0.6

Design1

Design1+

Design2-

Design2

Design3

CachePerfect Cache

55Kurt Keutzer

n(s*J)/Bit

n(s*J)/n(s*J)/BitBit

3.39

0.293

2.05

0.176

0.5320.416 0.3150.231 0.2070.148

00.5

11.5

22.5

33.5

Design1

Design1+

Design2-

Design2

Design3

CachePerfect Cache

56Kurt Keutzer

Die Area

2.1 2.12.372.37

6.146.14

6.7 6.7 6.7 6.7

01234567

Design1

Design1+

Design2-

Design2

Design3

CachePerfect Cache

mmmm22

57Kurt Keutzer

Conclusions

TIE extensions, cache configuration, and improved code efficiency resulted in an order of magnitude improvement from our original

For power and performance, the effect of cache size is greater than the effect of a higher clock frequency

Use voltage scaling to reduce the power

If streaming data, then scale frequency

Adding state will result in the ability to increase performance

Having the ability to remove core instructions will decrease decode complexity and should lower power and area

58Kurt Keutzer

Soft Core Viterbi Decoder

EECS 290A Project

Dave Chinnery, Rhett Davis, Chris Taylor, Ning Zhang

59Kurt Keutzer

High Level Architecture

23%36%30%

23%36%30%

0%48%15%

0%48%15%

38%8%22%

38%8%22%

18%4%16%

18%4%16%

9%2%8%

9%2%8%

4%1%5%

4%1%5%

2%1%4%

2%1%4%

% Gates% Area% Power

% Gates% Area% Power

60Kurt Keutzer

Branch & Path Metric Generation

UL

UL

UL

UL

UL

UL

UL

UL

Branch Metrics Computation apparently implemented with a CORDIC block (contains 840 MUX’s, 58 adders & flip-flops, 32 15-bit busses)

Branch Metrics Hard-wired to each ACS unit

Path Metrics Stored in ACS units

Each ACS unit handles 16 states

Hard-wired Path Metric Interconnect

61Kurt Keutzer

ACS Architecture

Each ACS unit stores 32 path metrics

Only two SRAM’s are active at a time

Across all four ACS units, each path metric is stored twice

SRAM accounts for 88% of the area and 27% of the power for each ACS unit

8x9 SRAM

PMU

PML

PMU

BMU

PML

BML

Add CompareSelect

Pipeline Register

MUX

62Kurt Keutzer

Traceback Architecture

State-Machine blocks are just large sum-of productscombinationalnetworks(351 gates each)

Each memory unit contains a 16x64 SRAM and logic(192 MUX’s, 128 flip-flops)

DecisionBits Traceback

Next_ramin

PipelineRegister

MUXSRAM

Traceback Memory Unit

192

OutDecisionBits

TracebackMemory Unit22% Area20% Power

Finite StateMachine11% Area13% Power

Traceback Unit

63Kurt Keutzer

Design Flow

Design Compiler Synthesis script (from Mentor/Inventra)

SRAM Generator (from Norman Walker)

VHDL gate-level sims (timing verification, switching activity annotation)

PowerMill Simulations (SRAM, core)

Design Compiler, Power Compiler (Static timing, power analysis)

Floor Planning (Preview)

Place & Route (Silicon Ensemble)

Interconnect Parasitic Extraction (“report simcap”

PowerMill simulations, PathMill static analysis

Design Compiler, Power Compiler (Static timing, power analysis with back-annotated interconnect parasitics)

Synthesis & Module Generation

Pre-Layout Verification & Analysis

Post-Layout Verification & Analysis

Floor Planning Place & Route

64Kurt Keutzer

Synthesis and SRAM Generation

Synthesis with Synopsys Design Compiler

� Constraint: 66 kHz clock (effectively infinite)

� Bottom-up synthesis of 62 VHDL entities

Low-Power SRAM generator (from Pleiades)

� Very large sense-amps, control logic

� Optimized for power, speed at low supply-voltages

� Word-length limited to a power of 2

65Kurt Keutzer

Simulation Models

Behavioral C

Behavioral VHDL

RTL VHDL

• Parameterized, bit-true, and fast

• Used for system level design and BER simulations

• Synthesizable, crafted for specific parameters and implementation structure

• Used for synthesis quality

• Parameterized, bit-true, and cycle-true• Used for structural simulations and test bench referen

66Kurt Keutzer

BER Simulation Results

67Kurt Keutzer

SRAM

Simulation Tools: TimeMill & PowerMill

Parameters

� 66 MHz clock

� Voltage 2.5V

� Random Generated Test Vectors

Results

� Power Analysis

� Timing Analysis

68Kurt Keutzer

SRAM: Power Numbers

SRAM used for ACS Unit

� 8 words by 9 data bits

Operations Avg.(µA) Avg.(mW) Avg.(pJ)

Read Activity 663.73 1.659 24.885

Write Activity 563.21 1.408 21.120

Read/Write 612.29 1.530 22.950

Parasitic ExtractionOperations Avg.(µA) Avg.(mW) Avg.(pJ)

Read Activity 949.89 2.3747 35.6205

Write Activity 772.830 1.9320 28.980

Read/Write 851.42 2.1285 31.9275

69Kurt Keutzer

SRAM: Power Numbers

SRAM used for Traceback Unit

� 16 words by 64 data bits

Operations Avg.(µA) Avg.(mW) Avg.(pJ)

Read Activity 2170.7 5.4267 81.4005

Write Activity 1893.4 4.7335 71.0025

Read/Write 2086.9 5.2172 78.2580

Parasitic Extraction?

70Kurt Keutzer

SRAM: Timing Numbers

Delays

� Delays� Setup Time; Hold Time� time needed for data address to become stable

Setup(ns) Hold(ns) Data Resolution(ns)

ACS SRAM ~1 ~2 ~1.8

Traceback SRAM ~1 ~2 ~5

71Kurt Keutzer

Place and Route

Floor planning of the Viterbi SRAM macro cells and standard cells was done in Preview, and Silicon Ensemble was used for routing.

Total SRAM macro cell area was 1.58 mm2 (1.08 mm2 with 9x8 SRAMs)

� Area of the 16 9x8 bit SRAM macro cells: 0.052 mm2 each, 62% larger than required, as 16x8 bit SRAMs were used (SRAM generator output had been verified for powers of 2)

� Area of the 3 16x64 bit SRAM macro cells: 0.25 mm2 each

Area of the standard cells 1.02 mm2 (0.35 mm2 from DEF file)

Final chip area was 4.0 mm2 (original estimate 2.5 mm2)

Parasitics for timing simulation were extracted from the final routed nets in Silicon Ensemble.

72Kurt Keutzer

Wiring Statistics

Six metal layers, layers 5 and 6 used for power and ground respectively

Ground and power spaced alternately 100 um apart horizontally and vertically.

There were about 6200 nets and 46,114 vias.

Total wire lengths:

metal layer 1: 3,293 um

metal layer 2: 458,440 um

metal layer 3: 510,517 um

metal layer 4: 218,023 um

metal layer 5: 96,882 um signal, and 38,400 um power

metal layer 6: 8,660 um signal, and 37,500 um ground

wire length: 685 mm horizontal, 611 mm vertical, total 1296 mm

73Kurt Keutzer

Final Placement and Routing

Significant routing congestion at 16 by 64 bit SRAM outputs, due to Silicon Ensemble grid size of 1 um (observe white and light blue wires).

Minimum of 6 unroutable nets observed, even at 12 mm2 chip area.

Final size was 1.25 mm x 3.2 mm, 4 mm2, with 9 unroutable nets.

Violation reports in Silicon Ensemble did not identify which nets were unroutable, other than problems with ground and power connections.

74Kurt Keutzer

Static Timing Checks

Delay BeforeAnnotation (ns)

Delay AfterAnnotation (ns)

Max ClockFrequency (MHz)

Max SymbolRate (Msps)

Critical Path 8.7 17 60 3.8Longest

SRAM Path8.5 14 - -

All timing checks performed with Design Compiler’s report_timing command

Parasitic capacitances back-annotated with the set_load command

No RC parasitics annotated

No SRAM model was used for timing checks

Critical Path was from ACS control logic, through a PM ouput MUX select signal (in an ACS unit), through the following ACS unit.

Checks performed at 2.5V

75Kurt Keutzer

Static Power Checks

Power Before Annotation After SAIFAnnotation

After ParasiticAnnotation

Cell Internal (mW): 28 20 20Net Switching (mW): 15 6.3 8.7Total Dynamic (mW): 43 26 29Cell Leakage (nW): 750 810 810

All timing checks performed with Design Compiler’s report_power command

Switching activity was measured for every output port (transition counts over 16,000-cycle simulation)

Back-annotation performed with SAIF files

No SRAM model was used for power checks (added in manually)

Checks performed at 2.5V w/ 60 MHz clock

76Kurt Keutzer

Delay and Energy Scaling

77Kurt Keutzer

Performance Results

For fixed throughput requirement 100ksps:

SupplyVoltage (V)

Clock Rate(MHz)

Symbol Rate(Msps)

Power(mW)

Optimized forPerformance

2.5 1.6 0.1 1.59

Optimized forEnergy

0.8 1.6 0.1 0.16

Optimized forEDP

1.25 1.6 0.1 0.40

SupplyVoltage (V)

Clock Rate(MHz)

Symbol Rate(Msps)

Energy DelayProduct (fJs)

Power(mW)

Optimized forPerformance

2.5 60 3.75 4.24 59.6

Optimized forEnergy

0.8 7.46 0.47 3.49 0.76

Optimized forEDP

1.25 25.12 1.57 2.53 6.24

78Kurt Keutzer

Summary NORMALIZED (100kbs)

Effort

(days)

Power (uW)/

Gate

Gates/

Area

Area

(mm^2)GatesNorm

Power

(mW)Performanc

e (kbs)Implementation

60.81423809.522.1050000294.440.68100.00CP 1

40.7376695.687.47500000266.836.86100.00ARM

60.0527040.066.694709817.92.47100.00CP 2

60.0763958.156.692648014.72.02100.00CP 3

30.0424599.4110.244709814.31.97100.00DSP

300.0048775.004.00351001.00.14100.00ASIC

79Kurt Keutzer

Summary MAX PERFORMANCE

Effort

(days)

Power (uW)/

Gate

Gates/

Area

Area

(mm^2)GatesNorm

Power

(mW)

Performance

(kbs)Implementation

N/AN/AN/AN/AN/AN/A100.00 N/AReference

40.866695.687.47500000.842.94116.48ARM

60.9623809.522.10500000.948.00118.00CP 1

31.904599.4110.24470981.889.46464.70DSP

64.067040.066.69470983.8191.00793.00CP 2

67.213958.156.69264803.8191.00966.00CP 3

301.448775.004.00351001.050.603750.00ASIC