dsp architectural considerations for optimal baseband processing sridhar rajagopal scott rixner...
TRANSCRIPT
DSP Architectural Considerations for Optimal Baseband Processing
Sridhar Rajagopal
Scott Rixner
Joseph R. Cavallaro
Behnaam Aazhang
Rice University, Houston, TX
This work has been supported by Nokia, TI, TATP and NSF
Wireless Communication Systems
Flexibility is required Mobile
– Switch between standards– Switch between parameters
Base-station– Varying number of users– Each user has different parameters
Wireless MobileDevice
BasebandProgrammable
CommunicationsProcessor
RF UnitA/DD/A
Integration of Cellular/Wireless LAN
W-CDMA base-station– 4Mbps– Delay constraints– Area constraints?
W-LAN base-station– 100Mbps– Delay constraints– Area constraints?
Mobile– W-CDMA & W-LAN– 1Mbps & 100Mbps/# of users– Delay, area, and power constraints!
Computation Requirements
Estimation, Detection and Decoding in a 4Mbps W-CDMA cellular multiuser system
0 50 100 150 200 250 30010
0
101
102
103
AL
Us
req
uir
ed f
or
real
-tim
e at
500
MH
z
Number of W-CDMA Cellular Users
AddMultiply
SLOW FADING (estimation every 1000 bits)
MEDIUM FADING(estimation every 100 bits)
FAST FADING(estimation every 10 bits)
DATA RATES PER USER
Proposed DSP System Evolution
Current solutions to meet real-time(Racks of DSPs)
ProgrammableDSP Processor
for4G wireless
systems
< x cm
< x cm
Future wireless DSP architecturesx = 2.5 (W-CDMA BS)x = 2.0 (W-LAN BS)x = 1.5 (Mobile Handset)
The System Design Challenge
Current single processor DSPs not powerful enough for next generation multi-standard applications
Algorithms well understood at data-flow level Can design real-time systems in fixed VLSI Pushing towards programmable implementation Stream processors provide an interesting
alternative
Research Contributions
Algorithms for future wireless communications– Multiuser channel estimation and detection– Task partitioning, parallelism, pipelining– Used DSPs to develop and understand algorithms
Special-purpose implementations– VLSI and FPGA mappings of algorithms– Conventional and on-line arithmetic
Flexible implementations (current work)– Future DSP architectures?– Stream processors?– Architectural innovation– Functional unit design and usage
Outline
Motivation
Parallel Algorithms for Estimation, Detection, and Decoding
Stream Processor Architecture
Performance Comparisons and Results
Typical Base-station Algorithms
Equalization? FFT Viterbi decoding
Multiuser channel estimation Multiuser detection Viterbi decoding
Turbo decoding Multiple antenna systems
Wireless LAN
W-CDMA
Advanced receiver schemes
Parallel W-CDMA Estimation/Detection/Decoding
Multiuser estimation– replaced matrix inversion by gradient descent
Multiuser detection– Parallel Interference Cancellation (PIC)– Pipelined algorithm that avoids block-based detection
Viterbi decoding– Trellis structures suited for decoding– Register exchange for survivor memory– No traceback latency
Estimation/Detection (64,32 sizes)
TTLLbbbb bbbbRR 00 **
HHLLbrbr rbrbRR 00 **
)RR*A(AA brbb
1ii1iii RxCxLxyy )y(signd ii
H
1H10
H01
H10
H0
1H0
L R
)]AAAdiag(AAAARe[A C
]ARe[A L
)y(signd
]xAxARe[y
ii
1iH1i
H0i
MultiuserEstimation
MultiuserDetection
Prepare Matricesfor Detection
X(0)
X(1)
X(2)X(3)
X(4)
X(5)
X(6)X(7)
X(8)
X(9)
X(10) X(11)
X(12)
X(13)
X(14) X(15)
X(0)
X(1)
X(2)X(3)
X(4)
X(5)
X(6)X(7)
X(8)
X(9)
X(10) X(11)
X(12)
X(13)
X(14) X(15)
X(0)
X(2)
X(4)X(6)
X(8)
X(10)
X(12)X(14)
X(1)
X(3)
X(5) X(7)
X(9)
X(11)
X(13) X(15)
X(0)
X(1)
X(2)X(3)
X(4)
X(5)
X(6)X(7)
X(8)
X(9)
X(10) X(11)
X(12)
X(13)
X(14) X(15)
a. Unsuitable Trellis b. Suitable Trellis c. Shuffled Suitable TrellisX(0)
X(1)
X(2)X(3)
X(4)
X(5)
X(6)X(7)
X(8)
X(9)
X(10) X(11)
X(12)
X(13)
X(14) X(15)
X(0)
X(1)
X(2)X(3)
X(4)
X(5)
X(6)X(7)
X(8)
X(9)
X(10) X(11)
X(12)
X(13)
X(14) X(15)
Viterbi Trellis for Rate ½ Code with K = 5
Survivor Management in Viterbi Decoding
Two techniques– Traceback – commonly used – Register exchange
Traceback is simpler– Less area in VLSI architectures– Drawback: Sequential and additional latency
Register exchange is faster– Parallel updates– Packing decoded bits in the register needs to access
the entire register
Outline
Motivation
Parallel Algorithms for Estimation, Detection, and Decoding
Stream Processor Architecture
Performance Comparisons and Results
DSP Evolution and Trends
DSP Architectures– Increased parallelism and computational throughput– TI TMS 320C6x generation of VLIW DSPs
Media Processing Architectures– Orders of magnitude increase in parallelism and
computational throughput – 3D graphics!– Imagine processor developed at MIT/Stanford
• Prototype fabricated and licensed by TI
• Flexible and extensible VLIW multiple cluster architecture
– Applicable to wireless communications?
The Imagine Architecture
Stream Register FileNetworkInterface
StreamController
Imagine Stream Processor
HostProcessor
Net
wor
k
AL
U C
lust
er 0
AL
U C
lust
er 1
AL
U C
lust
er 2
AL
U C
lust
er 3
AL
U C
lust
er 4
AL
U C
lust
er 5
AL
U C
lust
er 6
AL
U C
lust
er 7
SDRAMSDRAM SDRAMSDRAM
Streaming Memory System
Mic
roco
ntr
olle
r
Arithmetic Clusters
VLIW control 3 adders, 2 multipliers, 1 divider Scratch-pad and communication unit Distributed register files
CU
Inte
rclu
ster
N
etw
ork+
From SRF
To SRF
+ + * * /
Cross Point
Local Register File
Bandwidth Hierarchy
41.2 32-bit operations per word of memory bandwidth
2GB/s 32GB/s
SDRAM
SDRAM
SDRAM
SDRAM
Str
eam
R
egis
ter
File
ALU Cluster
ALU Cluster
ALU Cluster
544GB/s
Stream Programming
StreamC– Executes on host
processor– C++– Controls stream transfers
between main memory and SRF
void main() { Stream<int> a(256); Stream<int> b(256); Stream<int> c(256); Stream<int> d(1024); ... example1(a, b, c); example2(c, d); ... }
KernelC– Executes on clusters– C-like Syntax– Kernel computation– Compiled by iscd
KERNEL example1(istream<int> a,
istream<int> b,
ostream<int> c) {
loop_stream(a) {
int ai, bi, ci;
a >> ai;
b >> bi;
ci = ai * 2 + bi * 3;
c << ci;
}
}
1024-point FFT Performance
Processor Frequency Data Radix Time Power Energy
Imagine 500MHzFloat
(32-bit)2 7.4s 3.8W 28J
C6711 150MHzFloat
(32-bit)2 138s ~1.3W
C6411 300MHzFixed
(16-bit)mixed 21s ~0.5W
C6201 100MHzFixed
(16-bit)2 227s 0.6W 144J
Virtex II 125MHzFixed
(16-bit)4 2s
Media vs. Communications
Similarities– Data parallelism– Low-precision data– High computation rates
Different characteristics of communications processing– More data reorganization, such as matrix transposes– Bit-level operations
Explore space of stream processor architectures with isim– Cycle-accurate stream processor simulation– Flexible machine description language (read by both
simulator and compiler)– Vary number and design of functional units– Vary memory, register sizes– etc.
Outline
Motivation
Parallel Algorithms for Estimation, Detection, and Decoding
Stream Processor Architecture
Performance Comparisons and Results
Stream Data Flow
Matrixtranspose
Viterbikernel
Matrix multkernel
Correlationupdate kernel
Matrix mulC kernel
Data rearrangement
Buffer
Estimation bits
Detectionbits
MultiuserChannel
Estimation
MultiuserDetection
Decoding
Computation
Communication
Iterationupdate kernel
Matchedfilter kernel Matrix mul
L kernel
PIC kernel
Matrix Multiplication Kernel (Imagine)
32 cycle loop Executed on all 8 clusters Complexity
– O(N3) multiplies
– O(N3) adds 100% multiplier utilization
in the loop Divider is unnecessary!
Inner Loop
Instruction
Communication(waiting for input)
FU unavailable(input ready but
FU busy)
ADD0 ADD1 ADD2 MUL0 MUL1 DIV0
Replace Divider with Multiplier
22 cycle loop Executed on all 8 clusters 97% multiplier utilization
in the loop 85% adder utilization in
the loop
Changing functional units– Supported by
simulator/compiler
– Architecturally realistic
Instruction ADD0 ADD1 ADD2 MUL0 MUL1 MUL2
Kernel Computational Time
Algorithm Kernel Functional unit
utilization (3 +, 2 *)
Execution Time
(cycles)
Functional unit utilization (3 +, 3 *)
Execution Time
(cycles) 1 70%,100% 1224 78.6%,78% 1064
Est- 2 53%,91% 22720 85%,99% 14360 imate 3 55%,42% 1058 55%,28% 1058
Total 14464 Glue 4 59%,91% 7468 78%,84% 5573
Matrices 5 63%,96% 12192 68%,71% 11084 Total 16657
Detect 6 67%,100% 366 90%,90% 275 7 67%,96% 996 89%,84% 760 Total 1035
Decode 8 13%,2% 8044 13%,1.4% 8044
Estimation and Detection ExecutionKernel Execution Memory TransfersCycle
Stalledwaitingfor data
frommemory
Estimation
Detection(10 bits)
Viterbi Execution
Initialization
Decode(32 bits)
Kernel Execution Memory TransfersCycle
Real-time Performance
Slow Fading Medium Fading Fast Fading0
0.5
1
1.5
2
2.5x 10
4
estimationdetectiondecodingstall time
Real-Timeat 500 MHz
Clo
ck c
ycle
s
Rough DSP Comparison
1 2,3 4,5 6 710
-7
10-6
10-5
10-4
10-3
10-2
10-1
Estimation
Exe
cuti
on t
ime
IMAGINE
TI C67: Internal Memory
TI C67: External Memory
GlueMatrices
Detection
Future Work
Achieve real-time rates– Additional functional units (that can be used
efficiently!)– Eliminate communication stalls between kernels– Support for matrix transposes and bit-level operations
Power and area constraints– Low power stream processing– Scaling the architecture for handsets
Scalability with data rates– Boundaries of the architecture
Handset algorithms
Conclusions
Future wireless communications algorithms– exceed the capabilities of current DSPs– require flexibility to change algorithms and parameters– require efficient use of resources because of delay,
area, and power constraints Architectural developments are needed for
future DSPs– Stream processing is a promising approach– Additional hardware acceleration, akin to Viterbi
coprocessor on C64? The insights gained from our designs can be
applied to DSPs and other processors with constraints on delay, area and power.