university of michigan electrical engineering and computer science 1 a scalable low-power...

1 University of MichiganElectrical Engineering and Computer Science

A Scalable Low-power Architecture A Scalable Low-power Architecture For Software RadioFor Software Radio

Scott Mahlke

Collaboration between:University of Michigan, Arizona State University, and ARM Ltd.


Anatomy of a Cellular PhoneAnatomy of a Cellular Phone

BluetoothDSP+ASICs

GPSDSP+ASICs

BasebandProcessor

GPP+DSP+ASIC

AnalogFrontend

ASICs

ApplicationProcessorGPP+DSP

PowerManager

Camera

Keyboard

Display

Speaker

Microphone

W-CDMA


Software Defined Radio (SDR)Software Defined Radio (SDR)

• Use software routines instead of ASICs for the physical layer operations of wireless communication system

ASICs(PHY)ASICs(PHY)

ProgrammableHardware

ProgrammableHardware

SoftwareRoutinesSoftwareRoutines

• Rest of the talk– Characteristics of SDR algorithms– SODA architecture for power-efficient SDR– Compilation challenges and approach


Advantages of SDRAdvantages of SDR• Lower costs

– Platform longevity, higher volume– SW has lower development costs

• Time to market– Future protocols will have complex

implementations– Overlap testing/development cycles

• Adaptability– Standards change over time– Multi-mode operation– Sharing hardware resources

UWB EDGE 802.16a

802.16a Bluetooth

802.11b WCDMA 802.11n

SDR


Why is SDR Challenging?Why is SDR Challenging?

1

10

100

1000

0.1 1 10 100

Power (Watts)

Pe

ak

Pe

rfo

rma

nc

e (

Go

ps

)

Better

Pow

er Efficiency

10 Mops/m

W

100 Mops/m

W

1 Mops/m

W

GeneralPurpose

ProcessorsEmbeddedDSPs

Mobile SDRRequirements

Pentium MTI C6x

IBM CellHigh-end

DSPs


The Anatomy of Wireless ProtocolsThe Anatomy of Wireless Protocols

1. Filtering: suppress signals outside frequency band

2. Modulation: map source information onto signal waveforms

3. Channel Estimation: Estimate channel condition for transceivers

4. Error Correction: correct errors induced by noisy channel

LPF-Tx scrambler spreader InterleaverChannelencoder

LPF-Rx

searcher

descrambler despreader combiner

descrambler despreader

...

deinteleaverChanneldecoder

(turbo/viterbi)

Upper layersTransmitter

Receiver

D/A

A/D

FrontendW-CDMA Physical Layer Processing

LPF-Tx

LPF-Rx

scrambler spreader



combiner

searcher

InterleaverChannelencoder

deinteleaverChanneldecoder

(turbo/viterbi)


0

5000

10000

15000

20000

25000

30000

filtering modulation channelestimation

errorcorrection

[MO

PS

]

Idle State

Control Hold State

Active State

W-CDMA Workload ProfileW-CDMA Workload Profile

• One operation is equivalent to one RISC instruction• Searcher, Turbo decoder, and LPF are dominant workloads• Workload profile varies according to operation state


SDR Kernel CharacteristicsSDR Kernel Characteristics

• 8 to 16-bit precision• Vector operations

– long vectors– constant vector size

• Static data movement patterns

• Scalar operations

Kernels Type of Computation

Vector Width

W-CDMA

Filter Vector 64

Modulation Vector 2560

Channel Est. Vector 320

Error Correction Mixed 8 or 256

802.11a

Filter Vector 33

Modulation (FFT) Vector 64

Channel Est. Mixed 16

Error Correction Mixed 64


SODA System Architecture for 3GSODA System Architecture for 3G

• 4 PEs– static kernel mapping and

scheduling– SIMD+Scalar units

• 1 ARM GPP controller– scalar algorithms and

protocol controls

LocalMem

ExecutionUnit

PE

LocalMem

ExecutionUnit

PE

LocalMem

ExecutionUnit

PE

LocalMem

ExecutionUnit

PE

GlobalMemSystem ArchitectureARM

LocalMem

ExecutionUnit

PE

LocalMem

ExecutionUnit

PE

LocalMem

ExecutionUnit

PE

LocalMem

ExecutionUnit

PE



LocalMem

ExecutionUnit

PE

LocalMem

ExecutionUnit

PE

LocalMem

ExecutionUnit

PE

LocalMem

ExecutionUnit

PE


• 2-Level scratchpad memories– 12KB Local scratchpad memory

for stream queues– 64KB global scratchpad memory

for large buffers

• Low-throughput shared bus– 200MHz 32-bit bus– inter-PE communication using

DMA

SODA System Architecture for 3GSODA System Architecture for 3G


SODA PE ArchitectureSODA PE ArchitecturePE

Scalar pipeline

32x16bit

SSN

Vector to ScalarStage 1

SIMD Memory (8KB)

IR

RF ID

16bit EX

16bit WBALU

Scalar Memory (4KB)

32-waySIMD

IR

ScalarRF

RF ID

EX

WB

IR

AGURF

AGU ALU12bit

Inst.Mem.4KB

SIMD pipeline

AGU pipelineDMA16bit BUS

512bit


Scalar to Vector

RF ID

16bit EX Multiplier16bit W

BALU

RF ID


BALU

RF ID


BALU

RF ID


BALU

2 issue LIW (400MHz) - SIMD + (Scalar or AGU) DMA: - mem-to-mem transfer - access global memory


SODA PE SIMD PipelineSODA PE SIMD PipelinePE

Scalar pipeline

32x16bit

SSN


SIMD Memory (8KB)

IR

RF ID

16bit EX

16bit WBALU

Scalar Memory (4KB)

32-waySIMD

IR

ScalarRF

RF ID

EX

WB

IR

AGURF

AGU ALU12bit

Inst.Mem.4KB

SIMD pipeline


512bit


Scalar to Vector

RF ID


BALU

RF ID


BALU

RF ID


BALU

RF ID


BALU

16-bit 16 entries2 read/ 1 write port

RF

EX

16-bitMultiplier

40-bit ACC

16-bit

ALU

16bit

16bitWB

16bit


SODA PE SIMD PipelineSODA PE SIMD PipelinePE

Scalar pipeline

32x16bit

SSN


SIMD Memory (8KB)

IR

RF ID

16bit EX

16bit WBALU

Scalar Memory (4KB)

32-waySIMD

IR

ScalarRF

RF ID

EX

WB

IR

AGURF

AGU ALU12bit

Inst.Mem.4KB

SIMD pipeline


512bit


Scalar to Vector

RF ID


BALU

RF ID


BALU

RF ID


BALU

RF ID


BALU

SIMD: - 32 wide - predicated exec. - predicated neg.

Memory: - 512bit port - 1 read port - 1 write port - 8 KBytes


SODA PE SIMD Shuffle NetworkSODA PE SIMD Shuffle NetworkPE

Scalar pipeline

32x16bit

SSN


SIMD Memory (8KB)

IR

RF ID

16bit EX

16bit WBALU

Scalar Memory (4KB)

32-waySIMD

IR

ScalarRF

RF ID

EX

WB

IR

AGURF

AGU ALU12bit

Inst.Mem.4KB

SIMD pipeline


512bit


Scalar to Vector

RF ID


BALU

RF ID


BALU

RF ID


BALU

RF ID


BALU

SIMD Shuffle Network

• Shuffle exchange• Inverse shuffle exchange• Exchange only• Iterative feedback


SODA PE Scalar PipelineSODA PE Scalar PipelinePE

Scalar pipeline

32x16bit

SSN


SIMD Memory (8KB)

IR

RF ID

16bit EX

16bit WBALU

Scalar Memory (4KB)

32-waySIMD

IR

ScalarRF

RF ID

EX

WB

IR

AGURF

AGU ALU12bit

Inst.Mem.4KB

SIMD pipeline


512bit


Scalar to Vector

RF ID


BALU

RF ID


BALU

RF ID


BALU

RF ID


BALU

Scalar: - One 16-bit datapath - No mult unit Scalar memory: - 16bit port - 1 read/write port - 4 KBytes Scalar-to-Vector Vector-to-Scalar


Power Consumption at 180nmPower Consumption at 180nm

0

200

400

600

800

1000

1200

1400

PE DataMemory

PE SIMDRF

PE SIMDALUs

PE SIMDPipeline

PE Others GlobalMemory

SystemOthers

Po

wer

(m

W)

in 1

80n

m

W-CDMA (2Mbps) 802.11a (24Mbps)

• 180nm ~ 3 W, 26.6 mm2

• 90nm (est) ~ 0.5 W, 6.7 mm2


SDR Compilation StrategySDR Compilation Strategy• Two level application description

– System-level: Concurrent tasks extracted from “C + channels + attributes”

– Kernel-level: Data parallelism extracted from “C + vectors + Matlab operators”

• System compilation – Task level parallelism– Generates tasks, schedules, communication stubs, DMA

requests, timing assertions, synchronization, debug support• Kernel compilation – Data level parallelism

– Lower virtual DLP to physical implementation


Stylized Automatic ParallelizationStylized Automatic Parallelization

void main() { for(int i=0; i<N; ++i) { int x = spread(w[i]); int y = scramb(x); fir(y); }}

void main() { for(int i=0; i<N; ++i) { int x = spread(w[i]) @P0; int y = scramb(x) @P1; fir(y) @P2; }}

1. Task assignment

void t1() { for(int i=0; i<N; ++i) { int x = spread(w[i])@P0; putchannel(X,x); }}void t2() { for(int i=0; i<N; ++i) { int y = scramb(getchan(X))@P1; putchannel(Y,y); }}void t3() { for(int i=0; i<N; ++i) { fir(getchannel(Y)) @ P2; }}main() { fork (t1); fork (t2); fork (t3);}2. Decompose tasks


Kernel Level CompilationKernel Level Compilationtemplate<class T, TAPS, BSIZE>kernel FIR { vector<T, TAPS> z; vector<T, TAPS> coeff;

void set_coeff(vector<T, TAPS> c) { coeff = c; }

void run(channel<T, BSIZE> inbuf, channel<T, BSIZE> outbuf) { T in, out;

for (i = 0; i < BSIZE; i++) { in = inbuf.pop(); z += coeff * in; out = z[0]; outbuf.push(out); z = (z(1:TAPS-1),0); } }};

int main() { FIR<int8, 64, BUF_SIZE> fir65r; ...}

1. object declaration

// in = inbuf.pop();sld(sin, inbuf);agu_incr(inbuf, sizeof(int8));

// z += coeff * in;vdup(vin, sin);vmul(vt1, _fir65r_coeff_v0, vin);vmul(vt2, _fir65r_coeff_v1, vin);vadd(_fir65r_z_v0, _fir65_z_v0, vt1);vadd(_fir65r_z_v1, _fir65_z_v1, vt2);

2. operation translation


Final ThoughtsFinal Thoughts• 2G and 3G SDR solutions are achievable in 90nm

– 3.9G 4-10x more performance with mW power consumption• Core technologies for future networks

– OFDM 64 – 2048 point FFT– MIMO – use of multiple antennas for transmission/reception– Low density parity check codes

• Key insight: SDR requires innovation across algorithm, software and hardware

• SDR platforms offer low-cost, longevity, and adaptability

university of michigan electrical engineering and computer science 1 a scalable low-power...

Documents