an asynchronous array of simple processors for dsp applications zhiyi yu, michael meeuwsen, ryan...

An Asynchronous Array of Simple Processors for DSP Applications

Zhiyi Yu, Michael Meeuwsen, Ryan Apperson, Omar Sattari, Michael Lai, Jeremy Webb, Eric Work,

Tinoosh Mohsenin, Mandeep Singh, Bevan Baas

VLSI Computation Lab, ECE DepartmentUniversity of California, Davis, USA

Outline

• Project objectives• Key features of the AsAP processor• Design of the AsAP processor• Results

Target Applications

• Computationally intensive DSP and scientific apps– Key components in many systems– Require high performance– Limited power budgets– Require innovations in architecture and circuit design

Project Objectives

• Programming flexibility• High performance

– Throughput– Latency

• High energy efficiency• Suitable for future

fabrication technologiesP

erfo

rman

ce &

Ene

rgy

effic

ienc

yProgramming flexibility

ASIC

FPGA

Prog.DSP

AsAP

Outline


Key Features of the AsAP Processor

Chip multiprocessor High performance

Small memory& simple processor

High energy efficiencyGlobally asynchronous

locally synchronous (GALS)Technology scalability

Nearest neighbor communication

High Performance Through Chip Multiprocessor

• Increasing the clock frequency is challenging

• Parallelism is more promising– Instruction level parallelism (VLIW, Superscalar)– Data level parallelism (SIMD)– Task level parallelism

Higher clock frequency

Increased design complexity & lower energy efficiency

Deeper pipeline

Task Level Parallelism• Well suited for DSP applications

scramcoding

inter-leave

mod.map

loadfftinter-leave

IFFTwin-dow

up-sampfilter

scaleclip

up-sampfilter

in out

trainingpilots

802.11a/g wireless LAN (54 Mbps, 5 GHz) baseband transmit path

Task1Task2Task3

ABC

Memory

Task1 Task2A B C

Proc. Proc.1 Proc.2

Task3

Proc.3

Improves performance andpotentially reduces memory size

…

• Widely available in many DSP applications

… …

Memory Size in Modern Processors• Memory occupies much of the area in modern

processors — this reduces area available for the core and consumes large amounts of power

• Area and energy dissipation can be dramatically decreased with smaller memories

0%

20%

40%

60%

80%

100%

Are

a br

eakd

own other

mem

core

TI_C64x Itanium SPARC BlGe/L [ISSCC 02, 05]

Small Memory Requirements for DSP Tasks

• The memory required for common DSP tasks is quite small

• Several hundred words of memory are sufficient for many DSP tasks

Task IMem(words

)

DMem(words

)

N-pt FIR 6 2N8-pt DCT 40

16

8x8 2-D DCT 154 72

Conv. coding (k = 7)

29 14

Huffman encoder 200 330

N-pt convolution 29 2N64-pt complex FFT 97 192

Bubble sort 20 1

N merge sort 50 NSquare root 62 15

Exponential 108 32

GALS Clocking Style• The challenge of globally synchronous systems

– Design difficulty when using high clock frequencies, long clock wires caused by large chip sizes, and large circuit parameter variations

– High clock power consumption and lack of flexibility to independently control clock frequencies

• How about totally asynchronous design style– Lack of EDA tool support– Design complexity and circuit overhead

• The GALS compromise– Synchronous blocks each operating in their own

independent clock domain

Wires and On Chip Communication• Global wires are a concern

– Their length doesn’t shrink with technology scaling—assuming the chip size remains the same

– The ratio of global wire delay to gate delay doubles each generation

• Methods to avoid global wires– Network on chip (NOC)– Local communication– Nearest neighbor communication

nearest local

Outline


FIFO 032

words ALUMAC

Control

IMem 64

words

DMem128

words

OSC

Staticconfig

Dynamicconfig

Output

AsAP Block Diagram• GALS array of identical processors

– Each processor is a reduced complexity programmable DSP with small memories

– Each processor can receive data from any two neighbors and send data to any of its four neighbors

FIFO 132

words

IFetch

PC

Inst.Mem

Decode Mem.Read

SrcSelect

EXE 1 EXE 2 EXE 3 ResultSelect

WriteBack

FIFO 0 RD

FIFO 1 RD

DataMemRead

DCMemRead

De-code

Addr.Gens.

ALU DataMemWrite

DCMemWrite

Proc Output

Single Processor Architecture• 54 32-bit instructions, among which only

Bit-Reverse is algorithm specific• 9-stage pipeline

Bypass

Multiply Accumulator

×ACC

+

resethalt

clk

Programmable Clock Oscillator• Standard cell based• Configurable frequency

– Delay tunable stages using 7 parallel tri-state inverters

– 5 or 9 stage selection– 1 to 128 clock divider

• SR latch logic enables clean OSC halt

• Results– 1.66 MHz – 702 MHz– Max gap: 0.08 MHz

(1.66 – 500 MHz)

… … … …

… … … …

stage_sel

..

...

...

...

.

Clock divider

Write side Read sideeast

westsouth Write

logic

FIFO 0

Readlogic

east

north

west

south

CPU

FIFO 1Processor

east

northwestsouth

Inter-processor Communication

• Each processor contains two dual-clock FIFOs

• Rd/Wr in separate clock domains– Gray coded Rd/Wr

address across clock domains

• No FIFO failures in several weeks of testing with multiple procs

north

Binary Gray & Sync.

addr addr

SRAM

Advantages of theAsAP Clocking System

• Simplified clock tree design – The maximum span is < 1 mm in 0.18 µm

• Scalable – easy to add processors• Improved energy efficiency

– Clock halts in 9 cycles (processor dissipates leakage power only) and restores in < 1 cycle according to work availability

– 53% and 65% power savings for two applications– Independent clock and voltage scaling (individual

processor voltage scaling not implemented in this version)

Physical Design

• 0.18 µm TSMC• Standard cell• Design flow

– Completely synthesized, except oscillator– Macro memory blocks used for four main memories– Completely auto placed and routed

• To simplify physical design, clock gating not implemented in this version

Transistors: 1 Proc 230,000 Chip 8.5 millionMax speed: 475 MHz @ 1.8 VArea: 1 Proc 0.66 mm² Chip 32.1 mm²

Power (1 Proc @ 1.8V, 475 MHz): Typical application 32 mW Typical 100% active 84 mW Worst case 144 mWPower (1 Proc @ 0.9V, 116 MHz): Typical application 2.4 mW

SingleProcessor

OSC FIFOs

DMemIMem

810 µm5.65 mm

810 µm

5.68 mm

Chip Micrograph of the 6 x 6 Array

Outline


Area Evaluation

• Most of AsAP’s area is for the core (66%)• Each processor requires a very small area; more than

20x smaller than others

0%

20%

40%

60%

80%

100%

Are

a br

eakd

own

comm mem core

Pro

cess

or a

rea

(mm

²)

20x

All scaled to 0.13 µm[ISSCC 05, 00; ISCA04; CMPON96]

0.1

1

10

100 7230

8.3 7

0.34

29

TI RAW Fujitsu BlGe/ AsAP C64x 4-VLIW L

TI CELL/ Fujitsu RAW ARM AsAP C64x SPE 4-VLIW

Power and Performance

0.01

0.1

1

10 100 1000 10000

AsAPCELL/SPEFujitsu-4VLIWARMRAWTI C64x

Pow

er /

Clo

ck fr

eque

ncy

/ Sc

ale

* (m

W/M

Hz)

Peak performance density (MOPS/mm²)

All scaled to 0.13 µm

* Assume 2 ops/cycle for CELL/SPE and 3.3 ops/cycle for TI C64x

Note: word widths and workload not factored

[ISSCC 05, 00; ISCA04; CMPON96]

Higher performance

÷ area, and

Lower estimated

energy

DC inHuffm

DC inHuffm

Lv-shift1-DCT

ZigzagQuant.Zigzag

AC inHuffm

1-DCT

Transin DCT

8x8 DCT

Huffmanoutputinput

JPEG Core Encoder Implementation

• 9 processors• Fully functional on chip• 224 mW @ 300 MHz• ~1400 clock cycles for

each 8x8 block• Similar performance and

~11x lower energy dissipation than8-way VLIW TI C62x

AC inHuffm

[MICRO 02, ISCAS 02]

Pad

Scram

Conv. Code Punc Inter-

leave 1

Inter-leave 2

Mod.Map

PilotInsert

Train

IFFT BR

IFFT Mem

IFFT BF

IFFTOutput

GI/Wind.

GI/Wind.

IFFT Mem

IFFT BF

FIRFIR

IFFT BF

IFFT Mem

OutputSyncIFFT

Data bits

ToD/A

converter

802.11a/802.11g Wireless Transmitter Implementation

• 22 processors• Fully functional on chip• 407 mW @ 300 MHz

30% of 54 Mb/s• Code unscheduled and

lightly optimized• ~10x performance and

35x – 75x lower energy dissipation than 8-way VLIW TI C62x

[SIPS 04; ICC 02]

Summary• Scalable programmable processing array

– Many processors on a single chip– Reduced complexity processors with small memories– GALS clocking style– Nearest neighbor communication

• Results– 0.18 µm, 475 MHz @ 1.8 V

• 32 mW application power• 84 mW 100% active power• 2.4 mW application power @ 116 MHz and 0.9 V

– High performance density: 475 MOPS in 0.66 mm²– Well suited for future fabrication technologies

Acknowledgments• Funding

– Intel Corporation– UC Micro– NSF Grant No. 0430090– MOSIS– UCD Faculty Research Grant

• Special Thanks– R. Krishnamurthy, M. Anders, S. Mathew, S. Muroor,

W. Li, C. Chen, D. Truong

an asynchronous array of simple processors for dsp applications zhiyi yu, michael meeuwsen, ryan...

Documents