university of michigan electrical engineering and computer science 1 a scalable low-power...
Post on 21-Dec-2015
214 views
TRANSCRIPT
1 University of MichiganElectrical Engineering and Computer Science
A Scalable Low-power Architecture A Scalable Low-power Architecture For Software RadioFor Software Radio
Scott Mahlke
Collaboration between:University of Michigan, Arizona State University, and ARM Ltd.
2 University of MichiganElectrical Engineering and Computer Science
Anatomy of a Cellular PhoneAnatomy of a Cellular Phone
BluetoothDSP+ASICs
GPSDSP+ASICs
BasebandProcessor
GPP+DSP+ASIC
AnalogFrontend
ASICs
ApplicationProcessorGPP+DSP
PowerManager
Camera
Keyboard
Display
Speaker
Microphone
W-CDMA
3 University of MichiganElectrical Engineering and Computer Science
Software Defined Radio (SDR)Software Defined Radio (SDR)
• Use software routines instead of ASICs for the physical layer operations of wireless communication system
ASICs(PHY)ASICs(PHY)
ProgrammableHardware
ProgrammableHardware
SoftwareRoutinesSoftwareRoutines
• Rest of the talk– Characteristics of SDR algorithms– SODA architecture for power-efficient SDR– Compilation challenges and approach
4 University of MichiganElectrical Engineering and Computer Science
Advantages of SDRAdvantages of SDR• Lower costs
– Platform longevity, higher volume– SW has lower development costs
• Time to market– Future protocols will have complex
implementations– Overlap testing/development cycles
• Adaptability– Standards change over time– Multi-mode operation– Sharing hardware resources
UWB EDGE 802.16a
802.16a Bluetooth
802.11b WCDMA 802.11n
SDR
5 University of MichiganElectrical Engineering and Computer Science
Why is SDR Challenging?Why is SDR Challenging?
1
10
100
1000
0.1 1 10 100
Power (Watts)
Pe
ak
Pe
rfo
rma
nc
e (
Go
ps
)
Better
Pow
er Efficiency
10 Mops/m
W
100 Mops/m
W
1 Mops/m
W
GeneralPurpose
ProcessorsEmbeddedDSPs
Mobile SDRRequirements
Pentium MTI C6x
IBM CellHigh-end
DSPs
6 University of MichiganElectrical Engineering and Computer Science
The Anatomy of Wireless ProtocolsThe Anatomy of Wireless Protocols
1. Filtering: suppress signals outside frequency band
2. Modulation: map source information onto signal waveforms
3. Channel Estimation: Estimate channel condition for transceivers
4. Error Correction: correct errors induced by noisy channel
LPF-Tx scrambler spreader InterleaverChannelencoder
LPF-Rx
searcher
descrambler despreader combiner
descrambler despreader
...
deinteleaverChanneldecoder
(turbo/viterbi)
Upper layersTransmitter
Receiver
D/A
A/D
FrontendW-CDMA Physical Layer Processing
LPF-Tx
LPF-Rx
scrambler spreader
descrambler despreader
descrambler despreader
combiner
searcher
InterleaverChannelencoder
deinteleaverChanneldecoder
(turbo/viterbi)
7 University of MichiganElectrical Engineering and Computer Science
0
5000
10000
15000
20000
25000
30000
filtering modulation channelestimation
errorcorrection
[MO
PS
]
Idle State
Control Hold State
Active State
W-CDMA Workload ProfileW-CDMA Workload Profile
• One operation is equivalent to one RISC instruction• Searcher, Turbo decoder, and LPF are dominant workloads• Workload profile varies according to operation state
8 University of MichiganElectrical Engineering and Computer Science
SDR Kernel CharacteristicsSDR Kernel Characteristics
• 8 to 16-bit precision• Vector operations
– long vectors– constant vector size
• Static data movement patterns
• Scalar operations
Kernels Type of Computation
Vector Width
W-CDMA
Filter Vector 64
Modulation Vector 2560
Channel Est. Vector 320
Error Correction Mixed 8 or 256
802.11a
Filter Vector 33
Modulation (FFT) Vector 64
Channel Est. Mixed 16
Error Correction Mixed 64
9 University of MichiganElectrical Engineering and Computer Science
SODA System Architecture for 3GSODA System Architecture for 3G
• 4 PEs– static kernel mapping and
scheduling– SIMD+Scalar units
• 1 ARM GPP controller– scalar algorithms and
protocol controls
LocalMem
ExecutionUnit
PE
LocalMem
ExecutionUnit
PE
LocalMem
ExecutionUnit
PE
LocalMem
ExecutionUnit
PE
GlobalMemSystem ArchitectureARM
LocalMem
ExecutionUnit
PE
LocalMem
ExecutionUnit
PE
LocalMem
ExecutionUnit
PE
LocalMem
ExecutionUnit
PE
GlobalMemSystem ArchitectureARM
10 University of MichiganElectrical Engineering and Computer Science
LocalMem
ExecutionUnit
PE
LocalMem
ExecutionUnit
PE
LocalMem
ExecutionUnit
PE
LocalMem
ExecutionUnit
PE
GlobalMemSystem ArchitectureARM
• 2-Level scratchpad memories– 12KB Local scratchpad memory
for stream queues– 64KB global scratchpad memory
for large buffers
• Low-throughput shared bus– 200MHz 32-bit bus– inter-PE communication using
DMA
SODA System Architecture for 3GSODA System Architecture for 3G
11 University of MichiganElectrical Engineering and Computer Science
SODA PE ArchitectureSODA PE ArchitecturePE
Scalar pipeline
32x16bit
SSN
Vector to ScalarStage 1
SIMD Memory (8KB)
IR
RF ID
16bit EX
16bit WBALU
Scalar Memory (4KB)
32-waySIMD
IR
ScalarRF
RF ID
EX
WB
IR
AGURF
AGU ALU12bit
Inst.Mem.4KB
SIMD pipeline
AGU pipelineDMA16bit BUS
512bit
Vector to ScalarStage 2
Scalar to Vector
RF ID
16bit EX Multiplier16bit W
BALU
RF ID
16bit EX Multiplier16bit W
BALU
RF ID
16bit EX Multiplier16bit W
BALU
RF ID
16bit EX Multiplier16bit W
BALU
2 issue LIW (400MHz) - SIMD + (Scalar or AGU) DMA: - mem-to-mem transfer - access global memory
12 University of MichiganElectrical Engineering and Computer Science
SODA PE SIMD PipelineSODA PE SIMD PipelinePE
Scalar pipeline
32x16bit
SSN
Vector to ScalarStage 1
SIMD Memory (8KB)
IR
RF ID
16bit EX
16bit WBALU
Scalar Memory (4KB)
32-waySIMD
IR
ScalarRF
RF ID
EX
WB
IR
AGURF
AGU ALU12bit
Inst.Mem.4KB
SIMD pipeline
AGU pipelineDMA16bit BUS
512bit
Vector to ScalarStage 2
Scalar to Vector
RF ID
16bit EX Multiplier16bit W
BALU
RF ID
16bit EX Multiplier16bit W
BALU
RF ID
16bit EX Multiplier16bit W
BALU
RF ID
16bit EX Multiplier16bit W
BALU
16-bit 16 entries2 read/ 1 write port
RF
EX
16-bitMultiplier
40-bit ACC
16-bit
ALU
16bit
16bitWB
16bit
13 University of MichiganElectrical Engineering and Computer Science
SODA PE SIMD PipelineSODA PE SIMD PipelinePE
Scalar pipeline
32x16bit
SSN
Vector to ScalarStage 1
SIMD Memory (8KB)
IR
RF ID
16bit EX
16bit WBALU
Scalar Memory (4KB)
32-waySIMD
IR
ScalarRF
RF ID
EX
WB
IR
AGURF
AGU ALU12bit
Inst.Mem.4KB
SIMD pipeline
AGU pipelineDMA16bit BUS
512bit
Vector to ScalarStage 2
Scalar to Vector
RF ID
16bit EX Multiplier16bit W
BALU
RF ID
16bit EX Multiplier16bit W
BALU
RF ID
16bit EX Multiplier16bit W
BALU
RF ID
16bit EX Multiplier16bit W
BALU
SIMD: - 32 wide - predicated exec. - predicated neg.
Memory: - 512bit port - 1 read port - 1 write port - 8 KBytes
14 University of MichiganElectrical Engineering and Computer Science
SODA PE SIMD Shuffle NetworkSODA PE SIMD Shuffle NetworkPE
Scalar pipeline
32x16bit
SSN
Vector to ScalarStage 1
SIMD Memory (8KB)
IR
RF ID
16bit EX
16bit WBALU
Scalar Memory (4KB)
32-waySIMD
IR
ScalarRF
RF ID
EX
WB
IR
AGURF
AGU ALU12bit
Inst.Mem.4KB
SIMD pipeline
AGU pipelineDMA16bit BUS
512bit
Vector to ScalarStage 2
Scalar to Vector
RF ID
16bit EX Multiplier16bit W
BALU
RF ID
16bit EX Multiplier16bit W
BALU
RF ID
16bit EX Multiplier16bit W
BALU
RF ID
16bit EX Multiplier16bit W
BALU
SIMD Shuffle Network
• Shuffle exchange• Inverse shuffle exchange• Exchange only• Iterative feedback
15 University of MichiganElectrical Engineering and Computer Science
SODA PE Scalar PipelineSODA PE Scalar PipelinePE
Scalar pipeline
32x16bit
SSN
Vector to ScalarStage 1
SIMD Memory (8KB)
IR
RF ID
16bit EX
16bit WBALU
Scalar Memory (4KB)
32-waySIMD
IR
ScalarRF
RF ID
EX
WB
IR
AGURF
AGU ALU12bit
Inst.Mem.4KB
SIMD pipeline
AGU pipelineDMA16bit BUS
512bit
Vector to ScalarStage 2
Scalar to Vector
RF ID
16bit EX Multiplier16bit W
BALU
RF ID
16bit EX Multiplier16bit W
BALU
RF ID
16bit EX Multiplier16bit W
BALU
RF ID
16bit EX Multiplier16bit W
BALU
Scalar: - One 16-bit datapath - No mult unit Scalar memory: - 16bit port - 1 read/write port - 4 KBytes Scalar-to-Vector Vector-to-Scalar
16 University of MichiganElectrical Engineering and Computer Science
Power Consumption at 180nmPower Consumption at 180nm
0
200
400
600
800
1000
1200
1400
PE DataMemory
PE SIMDRF
PE SIMDALUs
PE SIMDPipeline
PE Others GlobalMemory
SystemOthers
Po
wer
(m
W)
in 1
80n
m
W-CDMA (2Mbps) 802.11a (24Mbps)
• 180nm ~ 3 W, 26.6 mm2
• 90nm (est) ~ 0.5 W, 6.7 mm2
17 University of MichiganElectrical Engineering and Computer Science
SDR Compilation StrategySDR Compilation Strategy• Two level application description
– System-level: Concurrent tasks extracted from “C + channels + attributes”
– Kernel-level: Data parallelism extracted from “C + vectors + Matlab operators”
• System compilation – Task level parallelism– Generates tasks, schedules, communication stubs, DMA
requests, timing assertions, synchronization, debug support• Kernel compilation – Data level parallelism
– Lower virtual DLP to physical implementation
18 University of MichiganElectrical Engineering and Computer Science
Stylized Automatic ParallelizationStylized Automatic Parallelization
void main() { for(int i=0; i<N; ++i) { int x = spread(w[i]); int y = scramb(x); fir(y); }}
void main() { for(int i=0; i<N; ++i) { int x = spread(w[i]) @P0; int y = scramb(x) @P1; fir(y) @P2; }}
1. Task assignment
void t1() { for(int i=0; i<N; ++i) { int x = spread(w[i])@P0; putchannel(X,x); }}void t2() { for(int i=0; i<N; ++i) { int y = scramb(getchan(X))@P1; putchannel(Y,y); }}void t3() { for(int i=0; i<N; ++i) { fir(getchannel(Y)) @ P2; }}main() { fork (t1); fork (t2); fork (t3);}2. Decompose tasks
19 University of MichiganElectrical Engineering and Computer Science
Kernel Level CompilationKernel Level Compilationtemplate<class T, TAPS, BSIZE>kernel FIR { vector<T, TAPS> z; vector<T, TAPS> coeff;
void set_coeff(vector<T, TAPS> c) { coeff = c; }
void run(channel<T, BSIZE> inbuf, channel<T, BSIZE> outbuf) { T in, out;
for (i = 0; i < BSIZE; i++) { in = inbuf.pop(); z += coeff * in; out = z[0]; outbuf.push(out); z = (z(1:TAPS-1),0); } }};
int main() { FIR<int8, 64, BUF_SIZE> fir65r; ...}
1. object declaration
// in = inbuf.pop();sld(sin, inbuf);agu_incr(inbuf, sizeof(int8));
// z += coeff * in;vdup(vin, sin);vmul(vt1, _fir65r_coeff_v0, vin);vmul(vt2, _fir65r_coeff_v1, vin);vadd(_fir65r_z_v0, _fir65_z_v0, vt1);vadd(_fir65r_z_v1, _fir65_z_v1, vt2);
2. operation translation
20 University of MichiganElectrical Engineering and Computer Science
Final ThoughtsFinal Thoughts• 2G and 3G SDR solutions are achievable in 90nm
– 3.9G 4-10x more performance with mW power consumption• Core technologies for future networks
– OFDM 64 – 2048 point FFT– MIMO – use of multiple antennas for transmission/reception– Low density parity check codes
• Key insight: SDR requires innovation across algorithm, software and hardware
• SDR platforms offer low-cost, longevity, and adaptability