[ieee 2009 ieee workshop on signal processing systems (sips) - tampere, finland...

6
IMPLEMENTATION OF THE W-CDMA CELL SEARCH ON A MPSOC DESIGNED FOR SOFTWARE DEFINED RADIOS Fabio Garzia", Roberto Airoldi, Tapani Ahonen, Jari Nurmi Department of Computer Systems Tampere University of Technology Tampere, Finland email: [email protected] ABSTRACT This paper describes the implementation of the W-CDMA cell search algorithm on a homogeneous general purpose Multi- Processor System-on-Chip architecture. The architecture is composed of nine nodes based on COFFEE RISC cores com- municating using hierarchical Network-on-Chip. The work focuses on the parallelization of the cell search algorithm, en- abling execution on different processing nodes, and exploiting the capabilities of the network-on-chip. We achieved a total speed-up of 7.3X when compared with a single processing core system, taking into account the overhead related with the communication between different nodes. The result is signifi- cant since very close to the theoretical maximum of 9X. Con- sidering the hardware implementation, the target cell search is performed in l04ms on an FPGA with 75MHz maximum frequency, and in 40ms on an ASIC circuit with 200MHz maximum frequency. 1. INTRODUCTION In the last years Software Defined Radio (SDR) became more and more popular, since they can access from the same termi- nal several wireless networks using different protocols. For example, a mobile phone can simultaneously access second and third generation cellular networks, as well as wireless lo- cal area networks (WLAN), getting the high data throughput typical of a WLAN and the mobility provided by the cellular networks. A typical approach for a SDR system is to use a DSP to define via software the exact protocol that is to be used in a given time. However standard DSP cores do not have enough processing power for the current applications. Sev- eral alternative approaches with different level of flexibility are possible. One approach is to design a dedicated hard- ware block for this purpose. A multithreaded, low-power, ap- *The author gratefully acknowledge Nokia Foundation and GETA, Re- search School in Electronics, Telecommunications and Automation, for their support. Dragomir Milojevic Universite Libre de Bruxelles, Bio, Electro and Mechanical Systems, CP165/56, av.F. Roosevelt 50, B-I050 Brussels, Belgium [email protected] plication specific processor for SDR applications, combining classical integer and SIMD units and embedded into a more complex SoC has been proposed in [1]. In the context ofMP- SoC platforms for SDR applications, a fully programmable, 4 SIMD cores architecture implemented in 90nm technology has been proposed in [2], achieving 2Mbps W-CDMA, and 24Mbps 802.11a. More specifically a low-power cell search in WCDMA has been described in [3]. A heterogeneous MP- SoC platform with run-time reconfigurable hardware has been presented in [4]. Another model for a programmable baseband receiver for SDR systems is described in [5]. In particular the system was designed coupling some application-specific coprocessor ac- celerators to a RISC processor. The radio technologies con- sidered in that research work were: Wideband Code Division Multiple Access (W-CDMA [6]) and Orthogonal Frequency Division Multiplexing (OFDM [7]). We believe that a more flexible approach would be re- quired if a system has to support different standards. Nev- ertheless the flexibility is affordable if it does not decrease significantly the overall performance. This paper follows this design choice, and describes the usage of a general-purpose multiprocessor system as baseband processor for SDR. The paper is organized as follows. In the next section we describe our template for MPSoC. Then we present the MPSoC tar- geted to SDR applications. A test case referred to W-CDMA cell search is in Section 4. 2. THE MPSOC TEMPLATE The multiprocessor system is based on Silicon Cafe template. The template provide a configurable VHDL model to create an MPSoC based on a fixed NoC infrastructure and a variable number of nodes. Each node can host different number of computational cores with different characteristics. The NoC infrastructure is built around a hierarchical net- work of switching interconnections. A local level of hierar- chy provides non-blocking connections between the compu- 978-1-4244-4335-2/09/$25.00 ©2009 IEEE 030 SiPS 2009

Upload: dragomir

Post on 17-Feb-2017

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: [IEEE 2009 IEEE Workshop on Signal Processing Systems (SiPS) - Tampere, Finland (2009.10.7-2009.10.9)] 2009 IEEE Workshop on Signal Processing Systems - Implementation of the W-CDMA

IMPLEMENTATION OF THE W-CDMA CELL SEARCH ON A MPSOC DESIGNED FORSOFTWARE DEFINED RADIOS

Fabio Garzia", Roberto Airoldi,Tapani Ahonen, Jari Nurmi

Department of Computer SystemsTampere University of Technology

Tampere, Finlandemail: [email protected]

ABSTRACT

This paper describes the implementation of the W-CDMA cellsearch algorithm on a homogeneous general purpose Multi­Processor System-on-Chip architecture. The architecture iscomposed of nine nodes based on COFFEE RISC cores com­municating using hierarchical Network-on-Chip. The workfocuses on the parallelization of the cell search algorithm, en­abling execution on different processing nodes, and exploitingthe capabilities of the network-on-chip. We achieved a totalspeed-up of 7.3X when compared with a single processingcore system, taking into account the overhead related with thecommunication between different nodes. The result is signifi­cant since very close to the theoretical maximum of 9X. Con­sidering the hardware implementation, the target cell searchis performed in l04ms on an FPGA with 75MHz maximumfrequency, and in 40ms on an ASIC circuit with 200MH zmaximum frequency.

1. INTRODUCTION

In the last years Software Defined Radio (SDR) became moreand more popular, since they can access from the same termi­nal several wireless networks using different protocols. Forexample, a mobile phone can simultaneously access secondand third generation cellular networks, as well as wireless lo­cal area networks (WLAN), getting the high data throughputtypical of a WLAN and the mobility provided by the cellularnetworks.

A typical approach for a SDR system is to use a DSPto define via software the exact protocol that is to be usedin a given time. However standard DSP cores do not haveenough processing power for the current applications. Sev­eral alternative approaches with different level of flexibilityare possible. One approach is to design a dedicated hard­ware block for this purpose. A multithreaded, low-power, ap-

*The author gratefully acknowledge Nokia Foundation and GETA, Re­search School in Electronics, Telecommunications and Automation, for theirsupport.

Dragomir Milojevic

Universite Libre de Bruxelles,Bio, Electro and Mechanical Systems, CP165/56,

av.F. Roosevelt 50,B-I050 Brussels, Belgium

[email protected]

plication specific processor for SDR applications, combiningclassical integer and SIMD units and embedded into a morecomplex SoC has been proposed in [1]. In the context ofMP­SoC platforms for SDR applications, a fully programmable,4 SIMD cores architecture implemented in 90nm technologyhas been proposed in [2], achieving 2Mbps W-CDMA, and24Mbps 802.11a. More specifically a low-power cell searchin WCDMA has been described in [3]. A heterogeneous MP­SoC platform with run-time reconfigurable hardware has beenpresented in [4].

Another model for a programmable baseband receiver forSDR systems is described in [5]. In particular the system wasdesigned coupling some application-specific coprocessor ac­celerators to a RISC processor. The radio technologies con­sidered in that research work were: Wideband Code DivisionMultiple Access (W-CDMA [6]) and Orthogonal FrequencyDivision Multiplexing (OFDM [7]).

We believe that a more flexible approach would be re­quired if a system has to support different standards. Nev­ertheless the flexibility is affordable if it does not decreasesignificantly the overall performance. This paper follows thisdesign choice, and describes the usage of a general-purposemultiprocessor system as baseband processor for SDR. Thepaper is organized as follows. In the next section we describeour template for MPSoC. Then we present the MPSoC tar­geted to SDR applications. A test case referred to W-CDMAcell search is in Section 4.

2. THE MPSOC TEMPLATE

The multiprocessor system is based on Silicon Cafe template.The template provide a configurable VHDL model to createan MPSoC based on a fixed NoC infrastructure and a variablenumber of nodes. Each node can host different number ofcomputational cores with different characteristics.

The NoC infrastructure is built around a hierarchical net­work of switching interconnections. A local level of hierar­chy provides non-blocking connections between the compu-

978-1-4244-4335-2/09/$25.00 ©2009 IEEE 030 SiPS 2009

Page 2: [IEEE 2009 IEEE Workshop on Signal Processing Systems (SiPS) - Tampere, Finland (2009.10.7-2009.10.9)] 2009 IEEE Workshop on Signal Processing Systems - Implementation of the W-CDMA

tational core and its peripherals forming a node . The globallevel of hierarchy enables comm unication between the nodesvia a mesh network. High flexibility is provided throughoutthe network by programmable arbitration and source routing.The two levels of hierarchy are bridged together by a com­munication assist interface acting as both an initiator (master)and a target (slave) device of a local comp utation node. Thereare thus at least two contending initiators (masters) in a node:the processor core and the initiator side of the bridge interface(I-BIF) .

The target (slave) devices of a node include the target sideof the bridge interface (T-BIF), the local data memory, andthe local instruction memory. Both memories are of scratchpad type, i.e., software managed caches without fixed hard­ware policies. Processor core's requests to access remote pe­ripherals, that is, peripherals of another node, are directed bythe local switches to the T-BlF. T-BlF looks up the route tothe destination address from a run-time reconfigurable sourcerouting table. In order to make lookup as fast as possible, theroutes are assigned to fixed size memory pages. Curren tly ispossible to configure up to 16 different routes .

The routing information defines the complete route to theremote peripheral(s) across the global and the local levels ofhierarc hy. This information is attached to the packet before itis passed on to the global network for delivery to the destina­tion node . When the packet reaches the destination node, it isabsorbed by the respec tive I-HIF. I-BlF executes remote readand write operations on the local node as well as responds toremote reads by passing the data to the T-BlF for delivery viathe global network to the remote requesters .

The communication infrastructure supports multicastingand broadcasting in two ways. Switching to multiple outputports is allowed on both global and local levels of hierarchy.Broadcasting by switching to multiple output ports of a globalswitch requires the remaining parts of the routes to the des­tinations be similar, since they share the source routing in­formation . Hence only regular casting patterns can be used .If a non-regular casting pattern is desired, the packet can beaddresses to both the remote peripheral of interest and the T­BIF of the destination node . The destination node T-BlF thenpasses a copy to another node, which in turn might route acopy to a third destination and so on. The last route on thecasting chain ends without addressing the remote T-BlF. Fur­ther details of the switches are described in [8] and [9].

3. A MULTI·PROCESSOR SYSTEM·ON·CHIP FORSDR

The template has been customized to realize the basebandprocessor for SDR applications. Hence we analyzed two ra­dio standards, OFDM and W-CDMA and we tried to realizean architecture flexible enough to support both the standards.

First of all we decided to use one COFFEE RISC core (see[10] and [11]) as single computational engine in every node .

031

IGlobal Switch

I Global Arbite r I

I....COFFEE r........ ......... r

. "-~ Initiator Fi--

! ;_..... NI h:0 ······..0 ·:

l~eq"e": Response: Switch I Local Arbiter I SWitch: 3.2 2.3

'I' 'I' 'I'

ITarget rUNI

BRIDGE

: I Data I• I Scr atchpad I······················1 Instruct ion rScratchpad r

COFFEE Cluster

Fig. 1. Block diagram of one COFFEE processing node .

The integer embedded multiplier could ease mapping of taskslike corre lations, while keeping low the overall comp lexity.The resulting architecture of each node is depicted in Figure1.

In addition a SDR baseband engine should receive thedata from the RF frontend . It is more convenient if only onenode, the one in the central position, is connected to the inputports and delivers the data to process to the other nodes . Alsothe same node can act as tasks scheduler and control also thetask distribution among processing node . From now on werefer to the central node as control node, while all the othersnode are defined as processing nodes (see Fig. 2).

Another parameter to fix was the number of nodes. Con­sidering the model described above, we thought that the num­ber of nodes should not be too high, in order to limit the la­tency of data distribution and the communication overhead .Moreover most of the wire less protocols are based on powerof 2 data streams. Hence a good choice for number of pro­cessing nodes might be a power of 2. Due to the mesh topol­ogy, a natural choice was 9 nodes (l control node plus 8 pro­cessing nodes) . This way the data distribution requires only1 to 2 hops and the workload between the node is uniformlydistributed.

The template provides the possibility to set the routes forthe comm unication between nodes. In our 9-core implemen­tation we adopted the following sets of routes .

Point-to -point communication. We defined nineroutes in the central node to enable the direct trans­fer of a packet towards a specific processing node . Onthe other side, we fixed in each processing node oneroute directed to the central node . This mechanismenab les a point-to-point comm unication between each

Page 3: [IEEE 2009 IEEE Workshop on Signal Processing Systems (SiPS) - Tampere, Finland (2009.10.7-2009.10.9)] 2009 IEEE Workshop on Signal Processing Systems - Implementation of the W-CDMA

A1/0

Fig. 2. Block diagram of the proposed MPSoC platform.

processing core and the control core (see Fig. 3(a» .

Broadcast communication. We set a route in thecontrol node that enables a broadcast mechanism . Ac­cording to the source routing specified, a packet sentthrough the route is sent to all the possible directions(up, down, left, right, red arrow in the Fig. 3(b» . Thepacket is absorbed in the destination nodes, but alsosent to the local T-BIF for further transmission. Inparticular the packet is redirected by the up and downnodes (Slave I and 6) to their left and right node (bluearrow in the Fig. 3(b». This way the broadcast isachieved.

Table 1. Synthesis results of the proposed MPSoC platformon 65nm technology with WCCOM (worst case commercial)corners, 0.95V and 85 degrees Celsius.

Entity Area

Entire platform 5.5mm2

Memories 4.2mm2

(288KB SP + 144KB DP)

Total logic 1.3mm2 (630Kgates)

Global network 0.12mm2 (62Kgates)

Node 0.6mm2

Data scratch pad 0.24mm2 (32KB, SP)

Instruction scratch pad 0.22mm2 (l6KB, DP)

Coffee RISC processor 0.14mm2 (66Kgates)

Local network 0.01mm2 (5Kgates)

Table 2. Synthesis results of the proposed MPSoC platformon Altera EP2S 180 FPGA Device

Entity Resource Utilization

ALUT Logic Regs %

Total logic 76780 50482 73%

Global network 2813 3548 3%

Node 8237 5177 8%

Coffee RISC processor 7862 4945 7.5%

Local network 346 232 0.3%

Fig. 3. Bidirectional Point-to-point Channels (a), and Two­level Broadcast Channel(b)

Slave 0 Slave 1 Slave 2

(a)

Slave 0 Slave 1 Slave 2

Slave 3 Slave 4.. Master . ..Slave 5 Sia 6 Slave 7

(b)

Table I collects the post-layout figures (without I/O pads)using a 65nm library with WCCOM (worst case commer­cial) corners , 0.95V and 85 degrees Celsius . The total logicutilization is equal to 630Kgates , with a maximum work­ing frequency of 200MH z. A single processor core occupies66Kgat es, while the local network is only 5Kgat es.

The FPGA implementation addressed an Altera Stratix IIFPGA and was performed using Quartus II version 8.0 SPI .Table 2 summarizes the synthesis results. The system occu­pies the 72% of the logic resources and the 60% of the mem­ory resources. The maximum achieved working frequency inthis case was 75MH z .

3.1. Hardware Implementation

The system has been targeted both to ASIC and FPGA imple­mentation.

032

4. TEST CASE: CELL SEARCH IN W-CDMA

We adopted as a case study the W-CDMA radio protocol.This work focuses on the cell search algorithm.

Page 4: [IEEE 2009 IEEE Workshop on Signal Processing Systems (SiPS) - Tampere, Finland (2009.10.7-2009.10.9)] 2009 IEEE Workshop on Signal Processing Systems - Implementation of the W-CDMA

ONEFRAME(1SSLOTS).. ..

P- SCH I I I IONESLOT---.

S-SCH I I I I

match the real-time requirements. This means the additionalbuffering of the input samples, that requires a 5KB dedicatedmemory.

After the correlation calculations, the central core per­forms the peak detection . If the absolute value of the last 3correlations is higher than a fixed threshold, the system hasan estimate about the location of the slot.

• frame synchronization;

• scrambling code identification;

• slot synchronization;

Fig. 4. Frame and slots transmitted for each channel of theW-CDMA protocol

The receiver uses the samples from the S-SCH to perform theframe synchronization, identifying the code group of the cellfound in the slot synchronization.

The frame synchronization can be performed executingsixteen parallel correlations over the fifteen slots that com­pose a frame. The correlations are executed between the re­ceived signal and all the possible secondary synchronizationcode sequences (that are 16). When we have got the 16 * 15correlations we can build the received codeword , searchingthe maximum value among the 16 correlations of each of the15 slots. Comparing this obtained codeword with the all pos­sible codewords (64) we are able to identify the code groupof the transmitting cell.

In the parallel implementation, each processing core exe­cutes 2 correlations for each one of the 15 slots. After that thecontrol core finds the index of the maximum correlation valuefor each slot, building the codeword . This codeword is com­pared with all the 64 possible ones. This is done in parallel inthe slave nodes, distributing 8 codewords per node. Knowingthe code group we can calculate the frame boundaries.

4.1.2. Frame Synchronization and Code Group 1D

During the scrambling code identification, the receiver deter­mines the exact primary scrambling code used by the cell.The primary scrambling code is identified through symbol ­by-symbol correlation over the CPICH with all codes withinthe code group identified in the previous step. There are 512possibles scrambling code are divided in 8 for each codeword .Thus once we have identified the code group in the framesynchronization step we have restricted the scrambling codeidentification from 512 candidates to only 8.

Considering the code group determined in the Frame syn­chronization, 8 different scrambling codes are generated andthose are correlated with the received signal of the next in­coming frame . Once again the correlations are distributedamong the processing node. Each node calculates a singlepoint of correlation between the incoming data stream and thecode group created as testing . Finally the control node findsthe maximum among the 8 correlation values returned by theprocessing cores .

4.1.3. Scrambling Code Identification

•••-----.CPICH

4.1. Cell Search Implementation

Cell search in W-CDMA is the mechanism that synchronizesthe mobile terminals to the downlink scrambling code trans­mitted by the closest base station. It can be divided into threesteps:

Three different transmission channels are used to easethe cell search operation : Primary Synchronization Chan­nel (P-SCH), Secondary Synchronization Channel (S-SCH)and Common Pilot Channel (CPICH) . The three channelsare shown in Fig. 4 for one frame period (lOms, 15 slots) .P-SCH transmits always the same sequence of 256 samplesat the beginning of each slot for any cell of the radio network .This channel is used for the slot synchronization phase. S­SCH transmits a different sequence of 256 samples per eachslot of the frame. The last channel , the CPICH, is used forthe identification of the scrambling code.

4././. Slot Synchronization

During the slot synchronization the receiver uses the P-SCHsequence to synchronize with a cell. This is done using a sin­gle Finite Impulse Response (FIR) filter to search the primarysynchronization code. The slot boundary of the cell can beobtained by detecting peaks in the filter output sequence .

The filter requires a correlation between input samplesand fixed coefficients (256 elements), implemented as a sum­of-products. In our multiprocessor engine, each processingnode can perform a portion of the sum-of-products (32 pernode), while the control core adds together all the temporaryresults . This operation can be performed off-line if it does not

033

Page 5: [IEEE 2009 IEEE Workshop on Signal Processing Systems (SiPS) - Tampere, Finland (2009.10.7-2009.10.9)] 2009 IEEE Workshop on Signal Processing Systems - Implementation of the W-CDMA

Table 3. Performance comparison between single processorarchitecture and proposed MPSoC platform

Application Uniproc. Multiproc. Speed(clock cycles) (clock cycles) Up

One Corr. Point 12381 2546 5XSlot Synch. 52890764 7147387 7.5X(Fixed Part)Frame Synch. 3750593 471458 8XScramb. Code 149973 56203 2.7X

The implementation described above has been simulated us­ing Mentor ModelSim. The output data have been comparedwith a Matlab model of the application to verify its correct be­havior. The model assumes a channel PSNR equal to lOdE.

The implementation results were compared with the per­formance of a single processor architecture based on the sameprocessor core. In particular we considered the number ofclock cycles needed to perform the different parts of the al­gorithm, assuming that the processor core is characterized bythe same maximum frequency in the single processor or themulti-processor approach. The results are collected in Table3.

The speed-up related with the execution of the whole ap­plication is 7.3X. In this calculation we suppose that the firstpeak is found after 50 calculations of correlations. The resultis very significant, since it approaches the theoretical maxi­mum (9X). This means that the implementation on the mul­tiprocessor system benefits by the code parallelization thanksto hardware mechanisms that reduce the communication over­head. In particular the broadcasting mechanism provided bythe network infrastructure gives the biggest contribution in thereduction of the transfer overhead.

Some other considerations can be done if we analyze theapplication step-by-step. The first step (the slot synchroniza­tion) is characterized by an indefinite number of correlationsfollowed by a fixed part of code. The number of correlationsdepends on when the first peak is detected. For benchmark­ing purposes, we consider the speed-up in the calculation ofa single correlation point and the speed-up related with thefixed part (the correlation average, that does not vary accord­ing to the position of the slot). The single processor archi­tecture executes a single correlation (256 accumulations ofcomplex products) in 12381 clock cycles, while the multi­processor system requires only 2546 clock cycles. This isequivalent to a 5X speed-up. This loss in performance ismainly due to the overhead required for the transfer of thedata stream to all the slave cores. The fixed part that does not

5. CONCLUSION

require any additional transfer gives a 7.5X speed-up. Thespeed-up in the frame-synchronization is 8X. Such a highvalue is due to the fact that all the transfers in this step use thebroadcast mechanism, thus minimizing the communicationoverhead. The scrambling code identification gives the worstspeed-up (2.7X). In this case the slave cores uses differentcoefficients that are known only after the second step. There­fore the transfer of these coefficients cannot benefit from thebroadcast mechanism. Since we need to perform only eightcorrelations, the communication overhead related with thesetransfers affects the final performance.

Considering the FPGA implementation, the entire cellsearch requires 104ms. This value is significantly lower thansome previous implementation of the cell search (see [12],300ms). The largest fraction of time is spent in the slot syn­chronization (95ms). Since it cannot be done in real-time(one slot is 0.67ms wide), the buffering of 5 slots is required(25 KB altogether). After that, the receiver should keepthe synchronization with the incoming samples, for examplebuffering only one slot and overwriting the values when thebuffer is full. This do not affect the feasibility of the appli­cation using a FPGA implementation. However accordingto [13] "symbols from more than one time slot may be com­bined non-coherently" only under the following condition.If TI, T2, and T3 are the times required by the three stages,because of clock drift, (TI + T2 + T3) should not be muchlarger than 50 time slots, that means 33.5ms. Our result ismuch larger than this limit, so the FPGA implementationwould not be affordable if we consider the clock drift issue.On the other hand the frame synchronization is performed in6.2ms, and since it is faster than the acquisition of a frameof samples, it does not require additional buffering. Sameconsiderations apply for the scrambling code identificationthat requires 0.75ms. We are assuming here that the threesteps are executed serially. If we consider a single core thecell search on FPGA would require 765ms.

Considering the ASIC implementation, the applicationcan be performed in 40ms on a system running at 200MHz.This means 60 time slots, and it would work fine also consid­ering the clock drift problem.

Generally speaking our results are worse than some dedi­cated implementations provided in other papers. In particularpipeline implementations are much faster (see [14], 30ms,and [15], 10ms). Nevertheless our platform is intended forgeneral-purpose use and characterized by a very high flexi­bility. The flexibility is very important when implementingmulti-standard systems like software defined radios.

This paper describes the implementation of the W-CDMAtarget cell search algorithm on a homogeneous, general pur­pose MPSoC platform design for Software Defined Radioapplications. Proposed multiprocessor system is built using

7802348 I 7.3X I57410380 I

4.2. Cell Search Results

I Cell Search

034

Page 6: [IEEE 2009 IEEE Workshop on Signal Processing Systems (SiPS) - Tampere, Finland (2009.10.7-2009.10.9)] 2009 IEEE Workshop on Signal Processing Systems - Implementation of the W-CDMA

nine nodes, each node using a single COFFEE RISC proces­sor core. The interconnection infrastructure is a hierarchicalNoC, that allows local communication between the core andthe memories and global communication between differentnodes.

We mapped target cell search algorithm on this system.We evaluated the speed-up, comparing our results with a sin­gle processor system using the same processing core. Theoverhead due to the communication between different nodeshas been limited, thanks to the broadcast transmission mech­anism implemented with the proposed communication infras­tructure. In the considered application, 2 out of 3 steps requirethe transmission of the same data stream to the different pro­cessing nodes. Using broadcast transmission we were ableto achieve a speed-up of 7.3X very close to the theoreticalmaximum of 9X.

The system has been synthesized for FPGA and ASIC.FPGA results show the execution of the cell search in 104ms,while ASIC results provide a execution time equal to 40ms.

6. REFERENCES

[1] Suman Mamidi, Emily R. Blem, Michael J. Schulte,John Glossner, Daniel Iancu, Andrei Iancu, MayanMoudgill, and Sanjay Jinturkar, "Instruction set exten­sions for software defined radio on a multithreaded pro­cessor," in CASES '05: Proceedings of the 2005 in­ternational conference on Compilers, architectures andsynthesis for embedded systems, New York, NY, USA,2005,pp.266-273,ACM.

[2] Yuan Lin, Hyunseok Lee, Yoav Harel, Mark Woh, ScottMahlke, Trevor Mudge, and Krisztian Flautner, "A Sys­tem Solution for High-Performance, Low Power SDR,"in Proceeding of the SDR 05 Technical Conference andProduct Exposition, 2005.

[3] Chi-Fang Li, Yuan-Sun Chu, Jan-Shin Ho, and Wern­Ho Sheen, "Cell search in wcdma under large-frequencyand clock errors: Algorithms to hardware implementa­tion," Circuits and Systems I: Regular Papers, IEEETransactions on [Circuits and Systems I: FundamentalTheory and Applications, IEEE Transactions on], vol.55, no. 2, pp. 659-671, March 2008.

[4] Gerard K. Rauwerda, Paul M. Heysters, and GerardJ. M. Smit, "Towards Software Defined Radios usingCoarse-grained Reconfigurable Hardware," IEEE Trans.Very Large Scale Integr. Syst., vol. 16, no. 1, pp. 3-13,2008.

[5] Lasse Harju, Programmable Receiver Architectures forMultimode Mobile Terminals., Ph.D. thesis, TampereUniversity of Technology (TUT) , Department of Infor­mation Technology, Institute of Digital and Computer

035

Systems (IDCS), Tampere, Finland, August 2006, 160pages, TUT Publication 604, ISSN: 1459-2045, ISBN:952-15-1618-6.

[6] E. Dahlman, P.Berning, J. Knutsson, F. Ovesjo, M. Pers­son, and C. Roobol, "WCDMA-the radio interface forfuture mobile multimedia communications," VehicularTechnology, IEEE Transactions on, vol. 47, no. 4, pp.1105-1118, November 1998.

[7] Xiaodong Wang, "OFDM and its application to 4G,"in Proc. International Conference on Wireless and Op­tical Communications 14th Annual WOCC 2005,22-23April 2005, p. 69.

[8] Tapani Ahonen, Designing network-based single-chipsystem architectures, Ph.D. thesis, Tampere Univer­sity of Technology (TUT) , Department of InformationTechnology, Institute of Digital and Computer Systems(IDCS), Tampere, Finland, October 2006, 242 pages,TUT Publication 625, ISSN: 1459-2045, ISBN: 952-15­1666-6.

[9] Tapani Ahonen and Jari Nurmi, "Synthesizable switch­ing logic for network-on-chip designs on 90nm tech­nologies," in Proceedings of the 2006 InternationalConference on IP Based SoC Design (IP-SOC '06).6-7December 2006, pp. 299-304, Design and Reuse S.A.

[10] Juha Kylliainen, Tapani Ahonen, and Jari Nurmi,"General-purpose embedded processor cores - the COF­FEE RISC example," in Processor Design: System-on­Chip Computing for ASICs and FPGAs, Jari Nurmi, Ed.,chapter 5, pp. 83-100. Kluwer Academic Publishers fSpringer Publishers, June 2007, ISBN-10: 1402055293,ISBN-13: 978-1-4020-5529-4.

[11] "COFFEE Core Project Website," consulted 20 January2009, URL: http://coffee.tut.fi.

[12] K. Higuchi, Y. Hanada, M. Sawahashi, and F. Adachi,"Experimental evaluation of 3-step cell search methodin w-cdma mobile radio," in Vehicular Technology Con­ference Proceedings, 2000. VTC 2000-Spring Tokyo.2000 IEEE 51st, 2000, vol. 1, pp. 303-307.

[13] S. Sriram and S. Hosur, "Fast acquisition method for ds­cdma systems employing asynchronous base stations,"Communications, 1999. ICC '99. 1999 IEEE Interna­tional Conference on, vol. 3, pp. 1928-1932, 1999.

[14] Y.-P.E. Wang and T. Ottosson, "Cell Search in W­CDMA," vol. 18, no. 8, pp. 1470-1482, Aug. 2000.

[15] Sang yun Hwang, Bub ju Kang, and Jae seok Kim,"Performance analysis of initial cell search using timetracker for w-cdma," in Global Telecommunica­tions Conference, 2001. GLOBECOM '01. IEEE, 2001,vol. 5,pp. 3055-3059.