elec 599 : performance evaluation of channel estimation algorithms for w-cdma...

Elec 599 : Performance Evaluation of Channel Estimation Algorithms forW-CDMA on Digital Signal Processors

Sridhar RajagopalECE DepartmentRice University

[email protected]

Abstract

This report shows the performance of maximum likelihood based channel estimation algorithms forW-CDMA on Digital Signal Processors (DSPs). We use the TI TMS3206701 floating point EvaluationModule (EVM) for this purpose. We aim to meet the real-time constraints of the channel estimationalgorithms. We exploit the various features available on the DSP by the use of intrinsic operators andassembly subroutines to improve the execution time. The parameter extraction algorithm takes 51 msfor channel estimation for a 150 bit preamble and Gold code of length 31 for 15 users. A comparisonwith General Purpose Processors (GPPs) for this algorithm shows that the GPPs can obtain a 4.49Ximprovement in performance. The joint estimation and detection algorithm gives better performancethan the parameter extraction method (2.92X for 15 users in DSPs)

Finally, we look at the limitations of the available processor technology and suggest improvementsin architecture and compilers to help meet the real-time performance requirements in algorithms forwireless communications.

1 Introduction

Wideband Direct-Sequence Code Division Multiple Access (W-CDMA) is the emerging protocol [1]for the next generation (3G) wireless communication systems. W-CDMA has been designed to addfeatures such as multimedia capabilities, high data rates and multi-rate services to the existing wirelesscommunication framework. The data rates proposed [3, 10] are 2Mbps indoor, 384 kbps pedestrian,and 144 kbps vehicular. The International Telecommunications Union (ITU) has developed IMT-2000(International Mobile Communications by 2000) standards for the third generation systems. To meetthe IMT-2000, several standards have been proposed and developed by different industrial committeesin countries such as the U.S, Europe and Japan. All these standards have accepted CDMA in one formor another as the multiple access method for wireless communications.

One of the main bottlenecks in wireless communications, both in terms of accuracy and speed, isthe estimation of channel parameters. Channel parameter estimation [11] is essential for finding thedelays and fading amplitudes in the system, which are used by the detector to detect the incomingdata. We look at this channel estimation problem as a representative of the algorithms used for wirelesscommunications. Currently, channel estimation processing [7] is done both on the mobile handset and

the base station using DSPs or Application-Specific ICs (ASICs) or a combination of both. Before thesealgorithms can be integrated with the next-generation CDMA technology, they need to be tested andimplemented on such a real system to evaluate their performance. This is the main motivation behindour performance evaluation of the channel estimation algorithms on DSPs.

The channel estimation algorithms have stringent time, power and size constraints and they put morepressure on the existing resources and the available processor technology. As our analysis shows, moreadvances are needed in current generation of microprocessors and DSPs to satisfy the real-time require-ments of W-CDMA.

The organization of this report is as follows. In Section 2, we introduce the channel estimation prob-lem. We show two different methods of maximum-likelihood based channel estimation along with theiradvantages and disadvantages. In section 3, we describe the simulation methodology used for our perfor-mance evaluation. We also discuss the DSP and GPP hardware used in our simulations. Section 4 showsresults obtained for the parameter extraction method on DSPs. It also contains a comparative study withGPPs for the parameter extraction method. In Section 5, we show the results obtained by the joint es-timation and detection scheme. Section 6 shows the pitfalls of the current architectures and compilertechnology and suggests improvements possible in order to meet the real-time performance constraints.We also discuss the conclusions and future directions in architectures for wireless communications insection 7.

2 Channel Estimation

The channel is the wireless interface between the mobile user and the base station. Many undesirableeffects such as interference from other users, delays from multiple paths, fading and noise occur on thesignal as it passes through the channel . The detector needs some synchronization information in orderto correctly detect the incoming bit sequence. Hence the parameters of the channel need to be estimatedfor proper detection. Channel estimation involves estimating and tracking the delays of each users’ bitsand the channel attenuation over the different paths. One of the popular methods for channel estimationis the Maximum Likelihood [13] algorithm, due to its ease of computation and high performance. Thisalgorithm is designed to handle time variations in the system, multiple propagation paths, and largenumber of users with varying level of transmitting power.

2.1 System Model

The transmission is usually done by using frames of data which have an initial preamble (a sequenceof bits known at the receiver). As the signal passes through the channel, the effects of the channel causeschanges in the preamble. By comparing the received bits with the known preamble, the properties of thechannel (delays and amplitudes) can be extracted from the signal. The channel parameters are assumedto remain static until the end of the frame. We have not incorporated tracking in our algorithm which mayeliminate the need for a new preamble in the next frame. The channel estimation algorithm discussedhere essentially considers multipath and multiuser effects but assumes a static channel (for the preambletime) and a single sensor at the receiver. It also does not include Doppler effects. We assume a shortspreading sequence system for our algorithms.

2

2.2 Parameter Extraction Based Estimation

The channel parameters are embedded in the received signal. Since the observation vector ri dependson the channel whose a priori statistics are unknown, a maximum likelihood estimate of the channelis often used. In our problem, ri is a function of the channel vectors zk, the noise covariance matrixK, which is assumed unknown and the transmitted bits bi. We assume that the bits bi are knownsince, otherwise, the maximization of the likelihood function is ill-posed. This is accomplished in theacquisition phase by requiring that all users transmit training sequences.

The following steps occur in the parameter extraction method :

Rrr =1

L

LXi=1

rirH

i

Rbr =1

L

LXi=1

birH

i

Rbb =1

L

LX

i=1

bibH

i

Y = RbrR�1

bb

K = Rrr �YRH

br

zH

k= (y

H

2k�1K�1U

R

k+ y

H

2kK�1U

L

k) �

(UR

0

kK�1U

R

k+U

L0

kK�1U

L

k)�1

whereRrr is the autocorrelation of the observation vector,Rbr the cross- correlation between the obser-vation vector and the preamble bits,Rbb the autocorrelation of the preamble bits,K the noise covariancematrix,Y the estimate of UZ, U the matrix of codes and Z the channel impulse response matrix.

A least squares fit of zk is performed to extract the strongest P paths. For each pair of adjacentcoefficients of zk, we obtain local values of amplitudes and delays, from the following optimization.

[wq; q] = argmin jjzk;q � (1� )wjj2 + jjzk;q+1 � wjj2:

We then search for the global maxima to obtain the strongest path :

q = argmax jwqj

� = (q + q)Tc ; w = wq:

where is the fractional part of the delay, q is the integer part of the delay, � is the estimated delayand w is the estimated amplitude.

The estimated path is subtracted from zk and the process is repeated to find the next strongest path,until a specified number of paths have been identified. For details, refer to [13].

3

2.3 Complexity for Uplink and Downlink

For the uplink case, since the noise of all the users is uncorrelated, we can make an assumption thatthe noise is white. Due to this assumption, the covariance matrix K reduces to an identity matrix andhence, may be omitted from the calculations. Also, we no longer need to evaluate Rrr as it is used onlyfor calculating the covariance matrix. For the downlink case, since the receiver is a mobile phone, thenumber of users is one. Hence, the complexity is independent of the number of users in the system.Table 1 (from [13])shows the complexity of the various operations for uplink and downlink.

Table 1. Complexity of the maximum likelihood algorithm

Processing Frequency Uplink Downlinkof execution

Estimate correlation per bit ofmatrices preamble O(MNK), O(K2

) O(MN)

EstimateY once perpreamble O(MNK2

) O(MN)

Estimate zk once perpreamble O(M2N2K) O(M2N2

)

Extract once perparameters preamble O(MNK) O(MN)

Estimate colored once pernoise covariance frame not required O(M2N2L)+O(M3N3

)

2.4 Joint Estimation and Detection

In the joint estimation and detection method, we do not need to extract the parameters from the channelimpulse response matrix. Instead, the matrix of the spreading codes and the unknown parameters isdirectly passed to the detector.

UzR = even columns of Y

UzL = odd columns of Y

The UZ matrix can be written as26666664

UzRUzL 0 0 0

0 UzRUzL 0 0

� � � � � � � � � � � �

0 0 0 UzRUzL

0 0 0 UzR

37777775

The size of the UZ matrix depends on the size of the detection window used for the multishot detec-tion. However, we notice that the same matrix is repeated D times, where D is the size of the detectionwindow.

4

1.25 ms

Access Slot 1

Access Slot 2

Random Access burst

Random Access burst

Frame boundary

Offset of slot

Figure 1. Access Slots in Slotted ALOHA

We can see that since the same matrix is repeated D times, we can change the detector to take in theinputs as a single matrix and thus, save on memory. Hence, it is enough to produce the Y matrix to begiven to the detector. Since calculating the Y matrix is just a part of the parameter extraction method,we can expect great savings in terms of computation in the joint estimation and detection method. Also,since the parameters are not extracted, the errors due to the extraction of parameters is also avoided. Fordetails, refer to [13].

2.5 Real-time Requirements of W-CDMA

The physical layer for the radio interface consists of different types of control channels. The randomaccess transport channel is the only common type transport channel in the uplink. The W-CDMA stan-dards [4, 5] for a physical random access channel is based on a slotted ALOHA approach, i.e. a mobilestation can start transmission at a number of well-defined time-slots relative to the frame boundary. Theaccess slots are spaced 1.25ms apart as shown in figure 1. The random access burst consists of 2 parts,a preamble part of 1ms followed by a message part of 10ms. Between the preamble and the message,there is an idle time of 0.25 ms. The idle time allows for the detection of the preamble sequence and thesubsequent on-line processing of the message part. This is shown in figure 2.

Random Access Burst

Preamble Part Message Part

1 ms 0.25 ms 10 ms

Figure 2. Structure of the Random Access Burst

5

iiiiI

Q

Data

Pilot

Rate

10 ms

Figure 3. Message Part of the Random Access Burst

The preamble part of a random access burst consists of a signature of length 16 complex symbols(� 1 � j). Each preamble is spread with a 256-chip real orthogonal Gold Code. There are a totalof 16 different signatures based on the orthogonal Gold code set of length 16. The chip rate is 4.096Mchip/sec.

The message part is split into I and Q channels, which transmit the data and control parts respectively,in parallel. The data part contains the random access request or small user packets and can support vari-able channel bit rates (16,32,64,128 kbps). The control part carries the pilot bits and the rate informationusing a spreading factor of 256. The rate information indicates which spreading factor is used in the datapart. Figure 3 shows the message part of the random access burst.

The slotted ALOHA structure is typically used only for transmitting data. The data which is receivedcould be buffered and estimation and detection of the bits can be done later. The real-time requirementsfor the algorithms will depend on the application. Applications such as video-conferencing and audiowill have more stringent real-time requirements than applications such as file transfer. We plan to studysuch applications and evaluate their real-time performance using the above algorithms.

3 Implementation Methodology

We study the performance of both versions of the Maximum Likelihood algorithms on the TI TMS320C6701Evaluation Module (EVM) which consists of the floating point DSP and its associated circuitry. We alsodo a comparison with GPPs for the parameter extraction method.

3.1 Algorithm Implementation

The parameters on which the algorithm depends are:* Preamble length (L)* Number of Users (K)* Number of Paths (P)* Signal-to-Noise ratio (SNR)* Signal-to-Interference-Noise ratio (SINR)* Spreading gain (N)Studies show that the preamble length should lie between 100 and 200. Gold Code 31 is used to assign

the spreading codes to the users. The maximum number of users which is supported by Gold Code 31 is33. However, the performance of the algorithm degrades for more than 15 users. The amplitude of the

6

MAI

ComplexAmplitude

Delays

Gold Codes

Preamble

Generate

Observation

Vector

MultipathNoise

SINR

K

N,K,P

N,K

K,L K*(L+1)

K*N

K*P

XK

K*P Actual Amplitudes

Actual Delays

Received

Observation

Vector

+N*L

SNR,L,N

A(P) = A/P

Figure 4. Method for Data Generation

signals decay exponentially with an increasing number of paths. Usually, 3-6 paths are considered to besufficient for considering multipath effects. The noise levels are usually in the range of 3-10 dB ( SNR= 3 implies signal has 2X more power than the noise). Also, the assumption that the difference betweenthe strongest and weakest users powers (SINR) as 10 is reasonable. For details, refer to [13]. Hence weuse the following parameters for our simulations :L = 150, K = 1 to 15, P = 3, SNR = 5dB, SINR = -10dB, N = 31. The data generation method used toproduce the inputs is as shown in figure 4.

The main computation involved in the algorithms are matrix multiplications, matrix inversions, andmatrix additions. The code was implemented in C. The algorithm used for inversion of matrices was theLU Decomposition method. For details about the algorithm and C code, refer to [12]. We use a simplePicksort method [12] for sorting and re-arranging arrays. Though not an optimal (O(P2

)) method, it isused as it is simple and P (the number of paths) is not large (three, in this case).

3.2 DSP Processor Model

We use the TI TMS320C6701, a 32-bit floating point DSP [17] for our simulations. The DSP has aVLIW type of architecture having a clock speed upto 133 Mhz. The core CPU consists of 32 double-word registers and 8 functional units having 2 multipliers and 6 ALUs. The functional units can bedivided into two groups of four, one in each data path (See Table 2). Each instruction must use adifferent functional unit. Each functional unit has its own write port into a register file. It can executeupto 8 instructions per cycle using all the units. The register files are split into 2 files (A and B), eachcontaining sixteen 32-bit registers. Two cross paths exist between the 2 groups of functional units andtwo groups of register files. This allows functional units on one data path to access the opposite side’sregister files. There are two 32-bit paths for loading memory to the register files. The C67 also has asecond 32-bit load path and can load two 32-bits registers in files A and B simultaneously. The DSP

7

Functional Unit FP operations.L unit(.L1,.L2) Arithmetic Operations, Compare, INT-FP conversions, normalization.S unit(.S1,.S2) Arithmetic Operations, Shifts, Reciprocal, Sq.root, Abs. value,SP-DP, Branches

.M unit(.M1,.M2) 16 x 16, 32 x 32 multiply.D unit(.D1,.D2) 32-bit add, subtract,linear and circular address calculation, Loads and Stores

Table 2. Functional Units and Operations performed on C67

has 64 KB each of internal program and data memory. The EVM contains external memory consistingof 256 KB SRAM and a 8 MB DRAM. The CPU has a load-store architecture, which means that theonly way to access data in memory is with a load-store operation. During an operation, data loadsand program fetches have the same operation in the pipeline, they just use different phases to completetheir operations. With both data loads and memory fetches, memory accesses are broken into multiplephases.This enables the C67 to access memory at a high speed. Most C67’s use 8-way interleavedmemory banks. Since each of these banks is single-ported memory, only one access to each bank isallowed per cycle. Two 32-bit memory operations per cycle are allowed as long as they do not accessthe same bank.

Table 3 gives the latencies of arithmetic operations on DSPs.

3.3 Implementation on DSP

The code was implemented on a floating point DSP (TI TMS320C6701). The code contains squareroot and divisions and a complete fixed point implementation was not feasible. Therefore, we decidedto implement the algorithm on the floating point DSP. The code was initially written in general floatingpoint. The code was first modified to remove file I/O for the DSP. Later, we replaced functions with inlinecode as we saw that the compiler did not produce efficient code and more aggressive optimizations werepossible by inlining the code. Also, this leads to elimination of some of the temporary variables whichconsume precious memory space on the DSP. The C67 DSP has assembly instructions which calculatethe approximate inverse reciprocal and inverse square root. We investigated use of these approximateinstructions. Also, we include assembly language subroutines [15] for critical parts of the code. Theassembly subroutines are not only highly optimized code, but they also help in further reducing sometemporary variables and loops. The use of assembly language subroutines may be thought of as similarto the use of the enhanced instruction set in the GPPs, where specialized instructions are used for perfor-mance enhancement. The assembly routines used were the routines for matrix vector multiplication andfor dot product calculation. Both of these routines used were TI’s floating point assembly benchmarks.

TMS320C67x DSP CyclesFP add instruction 1

FP multiply instruction 1Approximate FP divide instruction 1

FP divide function 28Approximate FP square root instruction 1

FP squareroot function 34

Table 3. Instruction Cycles for different DSP instructions

8

0 5 10 150

20

40

60

80

100

120

140 Use of Approximations in the DSP

Base C67 codeC67 Code with Approximations

Figure 5. Impact of specialized instructions on execution time

The code was compiled [19] using the TI Code Generation Tools version 3.0 so as to place the vari-ables and stack in the external SRAM [18] using the maximum optimization (-o3 : file optimization), alarge memory model (-ml3) and indicating specific pointer-based aliasing techniques are not used (-mt)for the C67 EVM. For details, refer [19].

3.3.1 Simulations for DSP

We use the TI Code Generation tools version 3.0 for the compiler and TI Code Composer Studio version4.01 for profiling the code performance on the DSP. The execution time is calculated by setting profilepoints in the proper part of the code in Code Composer and setting the proper clock rates. A Pentium II400 Mhz machine serves as the host to the EVM. The Code Composer Studio uses the JTAG interfacein the EVM for profiling the code.

4 Simulation Results and Analysis : Parameter Extraction

4.1 Impact of Specialized Instruction for DSPs

The execution time for the base floating point code versus the code using specialized instructionsfor square root and reciprocal is shown in figure 5. We get around 10% improvement in performancedue to the use of approximate specialized instructions. This is because the actual library for the DSPcontains these specialized instructions followed by 2 or 3 iterations of Newton-Raphson’s method forconvergence. We can see from Table 3, that the functions for floating point reciprocals and square-rootin the math library take 28 and 34 cycles [17] respectively, compared to the single cycle approximateinstructions. The approximate instructions are supposed to be accurate in the exponent and atleast 8 bitsin the mantissa. We find that the errors in the output due to the use of these hardware instructions areof the order of 0.3%. Thus, eliminating the use of the iterations for convergence and using the singlecycle approximate instructions improves the performance. However, the performance benefit is only

9

Base With Approx. With Assembly Can be Achieved0

10

20

30

40

50

60

70

80

90

100

No

rma

lise

d E

xe

cu

tio

n T

ime

Impact of Different Optimization Levels in DSPs

Figure 6. Optimization effects on execution time

10% because the square roots and inversions are used only in the parameter extraction part which is notthe critical code in the program.

4.2 Optimization levels for DSPs

We also show the performance benefits obtained by going to different levels of optimizations [16]for the C67 code. This is shown in Figure 6 for 15 users. As can be seen from the figure, we achieve1.08X improvement by using specialized instructions only for square roots and reciprocals. We can seethat the execution time reduces by 2.34X by using highly optimized assembly subroutines for matrixmultiplications and dot products, which is the critical part in the code. This is because the compiler isunable to produce highly efficient VLIW code for the C67. Also, using the assembly subroutines help ineliminating some of the loops and intermediate variables. Finally, we show another 2.42X improvementpossible by using more assembly code in our program. Thus, we can see an overall improvement of6.1X by modifying the original code to suit the DSP.

4.3 Comparing with GPP

For comparing the performance of GPP and DSP, the primary criteria is to compare the executiontimes of the channel estimation algorithm on each of the processors. We use the results from [6] for theGPP comparison.

Figure 7 shows the execution time of the algorithm for 15 users on DSP with varying implementationsand on GPP for different processor models. The first four bars correspond to the DSP : base compilergenerated code (DSP1), code using approximations (DSP2), code enhanced with assembly (DSP3) andbest possible code (DSP4) and their performance on the C67. The next four bars correspond to executiontime on UltraSPARC-2 without VIS (GPP1), UltraSPARC-2 with VIS (GPP2), ILP processor withoutVIS (GPP3), and ILP processor with VIS (GPP4).

10

DSP1 DSP2 DSP3 DSP4 GPP1 GPP2 GPP3 GPP40

10

20

30

40

50

60

70

80

90

100

No

rma

lise

d E

xe

cu

tio

n T

ime

GPP and DSP execution times for 15 users

0 5 10 150

10

20

30

40

50

60

Number of users −−>

Exe

cutio

n tim

e(in m

illis

eco

nds)

−−

>

GPP and DSP execution times for varying number of users

UltraSPARCIISPARCII with VISC67 with AssemblyCan be acheived : C67

Normalized Execution Time for 15 users Execution Time on GPP and DSP

Figure 7. Comparing GPP and DSP

From Figure 7, it can be seen that the GPPs perform better compared to DSPs in terms of executiontime. An appropriate comparison would be DSP3 with GPP2, which shows the GPP having a 4.49Xbetter performance. The differences could be attributed to the differences in the architecture (4-waySuperscalar for the GPP vs. 8-way VLIW for the DSP), cache hierarchy(16Kb internal program anddata caches and 1 MB external cache vs. no caches for the DSP) and the processor clock speeds (250Mhz vs 133 Mhz). The figure also shows how a future base station using state-of-the-art GPP processorswith ILP and 500MHz clock speeds would perform for channel estimation.

Figure 7 also shows the performance of the GPPs and DSPs on the algorithm for varying number ofusers. This figure gives us an idea of how things scale with increasing number of users. As we increasethe number of users, the computations increase linearly. From the figure, it can be seen that the curvesfollow an almost linear pattern for upto 15 users with the execution time increasing linearly with thenumber of users. As the number of users increase, the gap between GPP and DSP becomes more andmore pronounced. This is as expected because of the different slopes of the curves. The slope of thecurves is equal to the time taken for executing the algorithm for a single user on the correspondingprocessor. So the slope is going to be highest for the processor taking the maximum time for executionwhich implies that the greater the execution time, the steeper the curve. Another important thing toobserve from the figure is that there is a minor deviation from linearity as the number of users approaches15 with the curve tending to become more exponential. This hints at the curve becoming more and moreexponential as we increase the number of users above 15. This is again expected because with greaternumber of users, not only will the computations to be performed increase, the data set will also increase.So functional unit and memory stalls start affecting the execution times, preventing them from scalinglinearly.

11

100 120 140 160 180 2000

5

10

15

20

0.5

1

1.5

2

2.5

3

x 105

Number of users

Preamble length

Data memory requirements for Parameter Extraction Method

Da

ta m

em

ory

re

qu

ire

d in

byt

es

256KbGPP DSP 64Kb

Figure 8. Memory Requirements

4.3.1 Memory requirements

DSPs have very stringent memory requirements and the programmer has to decide where to allocate thememory based on the size of the data. If the data does not fit in the 64 KB internal data memory of thechip, the programmer has to place it in external memory. The C67 EVM has 256 KB external SRAMand 8 MB external DRAM. The memory requirements for the DSP is as shown in figure 8. It can beseen that the data sizes for channel estimation do not fit in the internal 64 KB on-chip data memory andhence, have to be placed in external SRAM. A comparison with the amount of memory required by aGPP is also made. Code written for GPPs tend to use more functions and temporary variables and thisexplains the increased static memory usage. Also, the operating system in the GPP would also occupya significant area in memory (not considered). However, since GPPs have large memory and cachesavailable at their disposal, the programmer need not worry about the memory constraints.

5 Simulation Results and Analysis : Joint Estimation and Detection

The joint estimation and detection scheme shows a significant performance improvement over theparameter extraction method. This is because the computations to be performed are only a subset of theparameter extraction method. The performance analysis is as shown in figure 9. As we can see from thefigure, the joint estimation and detection scheme gives a 2.92X speedup over the parameter extractionmethod for 15 users. If the detector is also considered for performance evaluation in terms of executiontime, the joint estimation and detection scheme would perform still better as the matrices need not berebuilt with the extracted parameters.

6 Meeting Real-Time Requirements

In this section, we discuss the shortcoming in current DSPs and ways to improve the processor archi-tecture and compilers for DSPs in order to meet real-time constraints of the channel estimation .

12

0 5 10 150

5

10

15

20

25

30

Number of users

Exe

cutio

n tim

e(m

illis

eco

nds)

Performance of Channel Estimation using joint estimation a nd detection on "C67

Joint Method Parameter extraction

Figure 9. Performance of Joint Estimation and Detection

6.1 Shortcomings of DSPs

The execution time for the C6x VLIW DSPs depends critically on the number of operations that canbe found to execute in parallel to fill in the slots for the functional units. Other issues such as memoryallocation and proper selection of the available specialized instructions also have a significant impacton the performance. The key idea of the VLIW architecture is to exploit the parallelism in the codeby statically scheduling instructions in the delay slots so as to keep all the functional units busy. Theparallelism is found by techniques such as software pipelining and loop unrolling. If all the slots are notfully utilized, they are filled with NOPs. This leads to a significant increase in memory size if there isnot sufficient parallelism. Hence, the VelociTI VLIW architecture has a concept of fetch packets andexecute packets. VelociTI allows smaller packets of parallel instructions to be contained within fetchpackets thus reducing the number of NOPs required in memory. If an execute packet crosses a fetchpacket boundary it is passed over to the next fetch packet and the current packet is filled with NOPs.Thus, we can see that the efficiency of the code depends to a very large extent on the efficiency of thecompiler. Compiler support for DSPs in terms of code and power efficiency is one of the hot areas incurrent compiler research and there is significant potential for improvement[9] . The common techniqueresorted to is to use hand-coded assembly code and/or specialized instructions for critical operations.However, assembly programming is difficult for VLIW as tracking and scheduling of so many unitsbecomes complicated.

The compiled assembly code does not show a lot of parallelism which is inherent in the applicationdue to poor compiler performance. Also, as the DSP profiling is done on the PC via the JTAG interface;that time also gets included in the profiling analysis and affects the accuracy of the execution timeestimate. The lack of public-domain simulators like RSIM for DSP architectures also place a limit ontrying to find the bottlenecks and perform simulations.

Also, the clock speed is low for DSPs because of a more complicated pipeline with the need to dosingle cycle MACs and the need to conserve power. This also linearly effects the performance.

13

6.2 Suggested Improvements in Architecture

� Internal Memory

The data requirements for channel estimation does not fit in the internal 64 KB memory of the DSPas we saw from figure 8. Hence, we have to go to the external SRAM which leads to a loss ofperformance and makes having off-chip memory necessary. Since the concept of data caches doesnot apply to DSPs, due to the streaming nature of the data, the memory latencies act as a bottleneckwhen external memory is used for DSPs. Hence, more internal memory would be useful for DSPs.Another tradeoff would be to have on-chip DRAM. On-chip DRAMs would be slower compared toSRAMs but, they are much smaller and consume lesser power, which is of considerable importanceto DSPs.

� Data Prefetching

Most of the operations are matrix oriented. So, when a memory access is made, it makes senseto bring the next memory elements along in the current memory access. This prefetching willreduce the memory latencies and provide improvement in performance, especially for large matrixoperations.

� ASIC and FPGA support

Once the critical part of the code is obtained, the DSP core should be flexible to include ASIC orFPGA support which can be used to offload the major computations. (Eg. Viterbi decoder in C54x)

� Additional specialized instructions

We saw that the major computations for channel estimation were matrix operations such as mul-tiplications, transpose, inversions and square root and reciprocals (for parameter extraction) Thespecialized instructions in the DSP were useful for the parameter extraction method of channelestimation. If there were instructions for array operations which work on the current data whileautomatically incrementing the memory address to prefetch the next memory locations, this couldbe a useful enhancement. Also, most of the operations in the channel estimation algorithm usecomplex arithmetic. Hence, if there were instructions which could handle complex arithmetic, itcould prove to be very useful in these applications.

6.3 Suggested Improvements in Compilers

� Compiler Efficiency

Compilers for DSPs are still lagging behind GPPs regarding the efficiency of the output code pro-duced. As a result, DSPs have to be programmed in assembly language to get optimum perfor-mance. The main issues with VLIW compilers is that it is too difficult to hand code the algorithmsin assembly for VLIW whereas the compiler is unable to generate sufficiently optimized code tokeep all units busy. After all, the advantages of advances in the architecture are wasted if there isnot enough compiler support to utilize the architecture benefits.

� OS support

Currently, the memory allocation for the program has to be done by the programmer. Hence, theprogrammer has to keep track of issues like code size, stack size, the memory locations where

14

code, data and stack needs to be placed and other such issues which are usually carried out bythe operating system in a GPP. If the programmer does not allocate the memory properly, theprogram may try to access uninhabited areas in memory. This process could be easily avoided,were compilers helped in memory allocation. The programmer may also not allocate memoryeffectively, placing data in external memory which could have been placed in internal memory.Small OS support should be made available to place the data, stack and code in the proper memorylocations without passing the onus to the programmer.

Our suggestions for OS support have been acknowledged by TI for their next version of CodeComposer Studio.

7 Future Directions in Processors for Wireless Communications

We can see that both the algorithms for channel estimation using the Maximum Likelihood method fallshort of the required real-time constraints on DSPs. Though GPPs show 1.85X better performance, theytoo fall short of the real-time requirements. We also see the effect of code optimizations and specializedinstructions on the DSPs. We find that the code optimized using assembly language subroutines shows6.1X improvement over regular C code. This shows the lack of compiler efficiency for DSPs.

Kozyrakis and Patterson [8] provide a good introduction to future processor needs in personal mobilecommunications. The requirements for future base stations are given in [14]. There are several needsfor future base stations along with low cost and high performance issues. There should be a on-chipintegration of large amounts of memory. Base stations require substantial amounts of memory in orderto handle rapid changes of data as callers enter and depart cell areas. As we show in our memoryrequirements(figure 8), the amount of memory required varies almost linearly with the number of users.Other requirements include low IC power consumption to reduce the need for cooling fans and spaceand to minimize need for backup power. Future base stations also need ICs with multiple advancedperipherals, like enhanced buffered serial ports, to help designers handle multiple channels per DSP.Support for multiple DSPs may also be required for performance enhancements with DSPs providinghost port interfaces for communication without additional glue logic.

As both DSPs and GPPs seem incapable of handling W-CDMA performance requirements, chip mak-ers are suggesting unified approaches [2] of mixed GPP-DSP-Coprocessor-ASIC-FPGA type solutions,which mainly consist of a DSP or a GPP or a unified core with ASICs, FPGAs as (internal/external)coprocessors.

There also have been other approaches considered. In [8], it is suggested that even a futuristic billion-transistor-in-a-chip GPP may not be able to satisfy personal mobile computing requirements. Theysuggest a merged GPP and DSP with the power budget of the latter. The key requirements for suchprocessors are high performance, energy and power efficiency, small size and low design complexity.

The use of a Vector IRAM [8] for future processors is also suggested. Vector IRAM is a combina-tion of vector processing and integration of logic and DRAM on a single chip. Since, it uses DRAMas main memory, the system has sufficient bandwidth for demanding applications and has a ease ofprogrammability similar to a vector or a SIMD processor.

The future for processor technology for wireless applications looks exciting and an interesting fusionof the above processor technologies may be seen in the near future.

15

8 Future Work

We would like to extend our maximum likelihood based channel estimation method to incorporatetracking, doppler effects and the effect of multiple sensors at the receiver. We also plan to look atsubspace based algorithms which incorporate tracking and where the use of a preamble can be avoidedWe also plan to evaluate the performance of these algorithms for the downlink case. The effects of theA/D converter at the receiver on the accuracy of the algorithms also needs to be investigated. We plan tolook at real-time applications such as audio to evaluate its real-time performance using our algorithms.

Among the architecture issues, we plan to look at ways to speed up the algorithms in hardware. Also,we look forward to develop and use future architectures for wireless communication.

9 Acknowledgments

I would like to thank my advisor, Dr. Joseph Cavallaro, for helping me with ideas and suggestionsfor the project, and patiently answering my doubts and difficulties. I would also like to thank Chaitali,Suman, Partha, Praful, Gang and Vishwas for their help in this project. Chaitali helped me a lot withunderstanding the details of the algorithm and provided me with Matlab code for the algorithms. Prafulhelped me with the simulations and provided the GPP comparison. Partha gave several suggestions onthe GPP comparison issues. Gang helped me with the DSP software. I also would like to thank Dr.Behnaam Aazhang and Dr. Sarita Adve for their guidance in this project.

References

[1] Fumiyuki Adachi, Mamoru Sawahashi, and Hirohito Suda. Wideband DS-CDMA for Next-Generation Mobile Communication Systems. IEEE Communications Magazine, 36(9):56–69,September 1998.

[2] Anthony Cataldo. Chip makers sketch plans for 3G Cell Phones. http:// www.eet.com/story/OEG19990416S0026, April 1999.

[3] Erik Dahlman, Bjorn Gudmundson, Mats Nilsson, and Johan Scold. UMTS/IMT-2000 Based onW-CDMA. IEEE Communications Magazine, 36(9):70–80, September 1998.

[4] Riaz Esmailzadeh and Maria Gustafsson. A New Slotted ALOHA Based Random Access Methodfor CDMA Systems. In ICUPC, October 1997.

[5] ETSI. Submission of Proposed Radio Transmission Technologies. Technical report, http://www.etsi.org/smg/UTRA/utra.pdf, 1998.

[6] Praful Kaul and Sridhar Rajagopal. General Purpose Vs. DSPs for 3G Wireless Communications.Internal Report, Rice University, April 1999.

[7] Zoran Kostic and Selvarajan Seetharaman. Digital Signal Processors in Cellular Radio Communi-cation. IEEE Communications Magazine, December 1997.

[8] Christoforos E. Kozyrakis and David A. Patterson. A New Direction for Computer ArchitectureResearch. IEEE Computer, pages 24–32, November 1998.

16

[9] Stephen Ohr. StarCore Design challenges DSP stronghold of TI. http:// www.edtn.com/story/tech/OEG19990419S0012-R, April 1999.

[10] Tero Ojanpera and Ramjee Prasad. An Overview of Air Interface Multiple Access for IMT-2000/UMTS. Technical report, Nokia Research Center, Delft University of Technology, September1998.

[11] Andreas Polydoros and Savo Glisic. Code Division Multiple Access Communications, pages 225–266. Kluwer Academic Publishers, 1995.

[12] Willam Press, Brian Flannery, Saul Teukolsky, and William Vetterling. Numerical Recipes in C.Cambridge University Press, 1991 edition, 1991.

[13] Chaitali Sengupta. Algorithms and Architectures for Channel Estimation in Wireless CDMA Com-munication Systems. PhD thesis, Rice University, 1998.

[14] TI. Base Station OEM Requirements. http://www.ti.com/sc/docs/wireless/97/oem.htm.

[15] TI. TMS320C67X Assembly Benchmarks at Texas Instruments. http:// www.ti.com/sc/docs/products/dsp/c6000/67bench.htm.

[16] TI. TMS320C62x/C67X : Programmer’s Guide. TI, February 1998.

[17] TI. TMS320C62x/C67x CPU and Instruction Set : Reference Guide. TI, March 1998.

[18] TI. TMS320C6x Evaluation Module : Reference Guide. TI, February 1998.

[19] TI. TMS320C6x Optimizing C Compiler : User’s Guide. TI, February 1998.

17

elec 599 : performance evaluation of channel estimation algorithms for w-cdma...

Documents