[ieee 2009 ieee workshop on signal processing systems (sips) - tampere, finland...
TRANSCRIPT
IMPLEMENTATION OF THE W-CDMA CELL SEARCH ON A MPSOC DESIGNED FORSOFTWARE DEFINED RADIOS
Fabio Garzia", Roberto Airoldi,Tapani Ahonen, Jari Nurmi
Department of Computer SystemsTampere University of Technology
Tampere, Finlandemail: [email protected]
ABSTRACT
This paper describes the implementation of the W-CDMA cellsearch algorithm on a homogeneous general purpose MultiProcessor System-on-Chip architecture. The architecture iscomposed of nine nodes based on COFFEE RISC cores communicating using hierarchical Network-on-Chip. The workfocuses on the parallelization of the cell search algorithm, enabling execution on different processing nodes, and exploitingthe capabilities of the network-on-chip. We achieved a totalspeed-up of 7.3X when compared with a single processingcore system, taking into account the overhead related with thecommunication between different nodes. The result is significant since very close to the theoretical maximum of 9X. Considering the hardware implementation, the target cell searchis performed in l04ms on an FPGA with 75MHz maximumfrequency, and in 40ms on an ASIC circuit with 200MH zmaximum frequency.
1. INTRODUCTION
In the last years Software Defined Radio (SDR) became moreand more popular, since they can access from the same terminal several wireless networks using different protocols. Forexample, a mobile phone can simultaneously access secondand third generation cellular networks, as well as wireless local area networks (WLAN), getting the high data throughputtypical of a WLAN and the mobility provided by the cellularnetworks.
A typical approach for a SDR system is to use a DSPto define via software the exact protocol that is to be usedin a given time. However standard DSP cores do not haveenough processing power for the current applications. Several alternative approaches with different level of flexibilityare possible. One approach is to design a dedicated hardware block for this purpose. A multithreaded, low-power, ap-
*The author gratefully acknowledge Nokia Foundation and GETA, Research School in Electronics, Telecommunications and Automation, for theirsupport.
Dragomir Milojevic
Universite Libre de Bruxelles,Bio, Electro and Mechanical Systems, CP165/56,
av.F. Roosevelt 50,B-I050 Brussels, Belgium
plication specific processor for SDR applications, combiningclassical integer and SIMD units and embedded into a morecomplex SoC has been proposed in [1]. In the context ofMPSoC platforms for SDR applications, a fully programmable,4 SIMD cores architecture implemented in 90nm technologyhas been proposed in [2], achieving 2Mbps W-CDMA, and24Mbps 802.11a. More specifically a low-power cell searchin WCDMA has been described in [3]. A heterogeneous MPSoC platform with run-time reconfigurable hardware has beenpresented in [4].
Another model for a programmable baseband receiver forSDR systems is described in [5]. In particular the system wasdesigned coupling some application-specific coprocessor accelerators to a RISC processor. The radio technologies considered in that research work were: Wideband Code DivisionMultiple Access (W-CDMA [6]) and Orthogonal FrequencyDivision Multiplexing (OFDM [7]).
We believe that a more flexible approach would be required if a system has to support different standards. Nevertheless the flexibility is affordable if it does not decreasesignificantly the overall performance. This paper follows thisdesign choice, and describes the usage of a general-purposemultiprocessor system as baseband processor for SDR. Thepaper is organized as follows. In the next section we describeour template for MPSoC. Then we present the MPSoC targeted to SDR applications. A test case referred to W-CDMAcell search is in Section 4.
2. THE MPSOC TEMPLATE
The multiprocessor system is based on Silicon Cafe template.The template provide a configurable VHDL model to createan MPSoC based on a fixed NoC infrastructure and a variablenumber of nodes. Each node can host different number ofcomputational cores with different characteristics.
The NoC infrastructure is built around a hierarchical network of switching interconnections. A local level of hierarchy provides non-blocking connections between the compu-
978-1-4244-4335-2/09/$25.00 ©2009 IEEE 030 SiPS 2009
tational core and its peripherals forming a node . The globallevel of hierarchy enables comm unication between the nodesvia a mesh network. High flexibility is provided throughoutthe network by programmable arbitration and source routing.The two levels of hierarchy are bridged together by a communication assist interface acting as both an initiator (master)and a target (slave) device of a local comp utation node. Thereare thus at least two contending initiators (masters) in a node:the processor core and the initiator side of the bridge interface(I-BIF) .
The target (slave) devices of a node include the target sideof the bridge interface (T-BIF), the local data memory, andthe local instruction memory. Both memories are of scratchpad type, i.e., software managed caches without fixed hardware policies. Processor core's requests to access remote peripherals, that is, peripherals of another node, are directed bythe local switches to the T-BlF. T-BlF looks up the route tothe destination address from a run-time reconfigurable sourcerouting table. In order to make lookup as fast as possible, theroutes are assigned to fixed size memory pages. Curren tly ispossible to configure up to 16 different routes .
The routing information defines the complete route to theremote peripheral(s) across the global and the local levels ofhierarc hy. This information is attached to the packet before itis passed on to the global network for delivery to the destination node . When the packet reaches the destination node, it isabsorbed by the respec tive I-HIF. I-BlF executes remote readand write operations on the local node as well as responds toremote reads by passing the data to the T-BlF for delivery viathe global network to the remote requesters .
The communication infrastructure supports multicastingand broadcasting in two ways. Switching to multiple outputports is allowed on both global and local levels of hierarchy.Broadcasting by switching to multiple output ports of a globalswitch requires the remaining parts of the routes to the destinations be similar, since they share the source routing information . Hence only regular casting patterns can be used .If a non-regular casting pattern is desired, the packet can beaddresses to both the remote peripheral of interest and the TBIF of the destination node . The destination node T-BlF thenpasses a copy to another node, which in turn might route acopy to a third destination and so on. The last route on thecasting chain ends without addressing the remote T-BlF. Further details of the switches are described in [8] and [9].
3. A MULTI·PROCESSOR SYSTEM·ON·CHIP FORSDR
The template has been customized to realize the basebandprocessor for SDR applications. Hence we analyzed two radio standards, OFDM and W-CDMA and we tried to realizean architecture flexible enough to support both the standards.
First of all we decided to use one COFFEE RISC core (see[10] and [11]) as single computational engine in every node .
031
IGlobal Switch
I Global Arbite r I
I....COFFEE r........ ......... r
. "-~ Initiator Fi--
! ;_..... NI h:0 ······..0 ·:
l~eq"e": Response: Switch I Local Arbiter I SWitch: 3.2 2.3
'I' 'I' 'I'
ITarget rUNI
BRIDGE
: I Data I• I Scr atchpad I······················1 Instruct ion rScratchpad r
COFFEE Cluster
Fig. 1. Block diagram of one COFFEE processing node .
The integer embedded multiplier could ease mapping of taskslike corre lations, while keeping low the overall comp lexity.The resulting architecture of each node is depicted in Figure1.
In addition a SDR baseband engine should receive thedata from the RF frontend . It is more convenient if only onenode, the one in the central position, is connected to the inputports and delivers the data to process to the other nodes . Alsothe same node can act as tasks scheduler and control also thetask distribution among processing node . From now on werefer to the central node as control node, while all the othersnode are defined as processing nodes (see Fig. 2).
Another parameter to fix was the number of nodes. Considering the model described above, we thought that the number of nodes should not be too high, in order to limit the latency of data distribution and the communication overhead .Moreover most of the wire less protocols are based on powerof 2 data streams. Hence a good choice for number of processing nodes might be a power of 2. Due to the mesh topology, a natural choice was 9 nodes (l control node plus 8 processing nodes) . This way the data distribution requires only1 to 2 hops and the workload between the node is uniformlydistributed.
The template provides the possibility to set the routes forthe comm unication between nodes. In our 9-core implementation we adopted the following sets of routes .
Point-to -point communication. We defined nineroutes in the central node to enable the direct transfer of a packet towards a specific processing node . Onthe other side, we fixed in each processing node oneroute directed to the central node . This mechanismenab les a point-to-point comm unication between each
A1/0
Fig. 2. Block diagram of the proposed MPSoC platform.
processing core and the control core (see Fig. 3(a» .
Broadcast communication. We set a route in thecontrol node that enables a broadcast mechanism . According to the source routing specified, a packet sentthrough the route is sent to all the possible directions(up, down, left, right, red arrow in the Fig. 3(b» . Thepacket is absorbed in the destination nodes, but alsosent to the local T-BIF for further transmission. Inparticular the packet is redirected by the up and downnodes (Slave I and 6) to their left and right node (bluearrow in the Fig. 3(b». This way the broadcast isachieved.
Table 1. Synthesis results of the proposed MPSoC platformon 65nm technology with WCCOM (worst case commercial)corners, 0.95V and 85 degrees Celsius.
Entity Area
Entire platform 5.5mm2
Memories 4.2mm2
(288KB SP + 144KB DP)
Total logic 1.3mm2 (630Kgates)
Global network 0.12mm2 (62Kgates)
Node 0.6mm2
Data scratch pad 0.24mm2 (32KB, SP)
Instruction scratch pad 0.22mm2 (l6KB, DP)
Coffee RISC processor 0.14mm2 (66Kgates)
Local network 0.01mm2 (5Kgates)
Table 2. Synthesis results of the proposed MPSoC platformon Altera EP2S 180 FPGA Device
Entity Resource Utilization
ALUT Logic Regs %
Total logic 76780 50482 73%
Global network 2813 3548 3%
Node 8237 5177 8%
Coffee RISC processor 7862 4945 7.5%
Local network 346 232 0.3%
Fig. 3. Bidirectional Point-to-point Channels (a), and Twolevel Broadcast Channel(b)
Slave 0 Slave 1 Slave 2
(a)
Slave 0 Slave 1 Slave 2
Slave 3 Slave 4.. Master . ..Slave 5 Sia 6 Slave 7
(b)
Table I collects the post-layout figures (without I/O pads)using a 65nm library with WCCOM (worst case commercial) corners , 0.95V and 85 degrees Celsius . The total logicutilization is equal to 630Kgates , with a maximum working frequency of 200MH z. A single processor core occupies66Kgat es, while the local network is only 5Kgat es.
The FPGA implementation addressed an Altera Stratix IIFPGA and was performed using Quartus II version 8.0 SPI .Table 2 summarizes the synthesis results. The system occupies the 72% of the logic resources and the 60% of the memory resources. The maximum achieved working frequency inthis case was 75MH z .
3.1. Hardware Implementation
The system has been targeted both to ASIC and FPGA implementation.
032
4. TEST CASE: CELL SEARCH IN W-CDMA
We adopted as a case study the W-CDMA radio protocol.This work focuses on the cell search algorithm.
ONEFRAME(1SSLOTS).. ..
P- SCH I I I IONESLOT---.
S-SCH I I I I
match the real-time requirements. This means the additionalbuffering of the input samples, that requires a 5KB dedicatedmemory.
After the correlation calculations, the central core performs the peak detection . If the absolute value of the last 3correlations is higher than a fixed threshold, the system hasan estimate about the location of the slot.
• frame synchronization;
• scrambling code identification;
• slot synchronization;
Fig. 4. Frame and slots transmitted for each channel of theW-CDMA protocol
The receiver uses the samples from the S-SCH to perform theframe synchronization, identifying the code group of the cellfound in the slot synchronization.
The frame synchronization can be performed executingsixteen parallel correlations over the fifteen slots that compose a frame. The correlations are executed between the received signal and all the possible secondary synchronizationcode sequences (that are 16). When we have got the 16 * 15correlations we can build the received codeword , searchingthe maximum value among the 16 correlations of each of the15 slots. Comparing this obtained codeword with the all possible codewords (64) we are able to identify the code groupof the transmitting cell.
In the parallel implementation, each processing core executes 2 correlations for each one of the 15 slots. After that thecontrol core finds the index of the maximum correlation valuefor each slot, building the codeword . This codeword is compared with all the 64 possible ones. This is done in parallel inthe slave nodes, distributing 8 codewords per node. Knowingthe code group we can calculate the frame boundaries.
4.1.2. Frame Synchronization and Code Group 1D
During the scrambling code identification, the receiver determines the exact primary scrambling code used by the cell.The primary scrambling code is identified through symbol by-symbol correlation over the CPICH with all codes withinthe code group identified in the previous step. There are 512possibles scrambling code are divided in 8 for each codeword .Thus once we have identified the code group in the framesynchronization step we have restricted the scrambling codeidentification from 512 candidates to only 8.
Considering the code group determined in the Frame synchronization, 8 different scrambling codes are generated andthose are correlated with the received signal of the next incoming frame . Once again the correlations are distributedamong the processing node. Each node calculates a singlepoint of correlation between the incoming data stream and thecode group created as testing . Finally the control node findsthe maximum among the 8 correlation values returned by theprocessing cores .
4.1.3. Scrambling Code Identification
•••-----.CPICH
4.1. Cell Search Implementation
Cell search in W-CDMA is the mechanism that synchronizesthe mobile terminals to the downlink scrambling code transmitted by the closest base station. It can be divided into threesteps:
Three different transmission channels are used to easethe cell search operation : Primary Synchronization Channel (P-SCH), Secondary Synchronization Channel (S-SCH)and Common Pilot Channel (CPICH) . The three channelsare shown in Fig. 4 for one frame period (lOms, 15 slots) .P-SCH transmits always the same sequence of 256 samplesat the beginning of each slot for any cell of the radio network .This channel is used for the slot synchronization phase. SSCH transmits a different sequence of 256 samples per eachslot of the frame. The last channel , the CPICH, is used forthe identification of the scrambling code.
4././. Slot Synchronization
During the slot synchronization the receiver uses the P-SCHsequence to synchronize with a cell. This is done using a single Finite Impulse Response (FIR) filter to search the primarysynchronization code. The slot boundary of the cell can beobtained by detecting peaks in the filter output sequence .
The filter requires a correlation between input samplesand fixed coefficients (256 elements), implemented as a sumof-products. In our multiprocessor engine, each processingnode can perform a portion of the sum-of-products (32 pernode), while the control core adds together all the temporaryresults . This operation can be performed off-line if it does not
033
Table 3. Performance comparison between single processorarchitecture and proposed MPSoC platform
Application Uniproc. Multiproc. Speed(clock cycles) (clock cycles) Up
One Corr. Point 12381 2546 5XSlot Synch. 52890764 7147387 7.5X(Fixed Part)Frame Synch. 3750593 471458 8XScramb. Code 149973 56203 2.7X
The implementation described above has been simulated using Mentor ModelSim. The output data have been comparedwith a Matlab model of the application to verify its correct behavior. The model assumes a channel PSNR equal to lOdE.
The implementation results were compared with the performance of a single processor architecture based on the sameprocessor core. In particular we considered the number ofclock cycles needed to perform the different parts of the algorithm, assuming that the processor core is characterized bythe same maximum frequency in the single processor or themulti-processor approach. The results are collected in Table3.
The speed-up related with the execution of the whole application is 7.3X. In this calculation we suppose that the firstpeak is found after 50 calculations of correlations. The resultis very significant, since it approaches the theoretical maximum (9X). This means that the implementation on the multiprocessor system benefits by the code parallelization thanksto hardware mechanisms that reduce the communication overhead. In particular the broadcasting mechanism provided bythe network infrastructure gives the biggest contribution in thereduction of the transfer overhead.
Some other considerations can be done if we analyze theapplication step-by-step. The first step (the slot synchronization) is characterized by an indefinite number of correlationsfollowed by a fixed part of code. The number of correlationsdepends on when the first peak is detected. For benchmarking purposes, we consider the speed-up in the calculation ofa single correlation point and the speed-up related with thefixed part (the correlation average, that does not vary according to the position of the slot). The single processor architecture executes a single correlation (256 accumulations ofcomplex products) in 12381 clock cycles, while the multiprocessor system requires only 2546 clock cycles. This isequivalent to a 5X speed-up. This loss in performance ismainly due to the overhead required for the transfer of thedata stream to all the slave cores. The fixed part that does not
5. CONCLUSION
require any additional transfer gives a 7.5X speed-up. Thespeed-up in the frame-synchronization is 8X. Such a highvalue is due to the fact that all the transfers in this step use thebroadcast mechanism, thus minimizing the communicationoverhead. The scrambling code identification gives the worstspeed-up (2.7X). In this case the slave cores uses differentcoefficients that are known only after the second step. Therefore the transfer of these coefficients cannot benefit from thebroadcast mechanism. Since we need to perform only eightcorrelations, the communication overhead related with thesetransfers affects the final performance.
Considering the FPGA implementation, the entire cellsearch requires 104ms. This value is significantly lower thansome previous implementation of the cell search (see [12],300ms). The largest fraction of time is spent in the slot synchronization (95ms). Since it cannot be done in real-time(one slot is 0.67ms wide), the buffering of 5 slots is required(25 KB altogether). After that, the receiver should keepthe synchronization with the incoming samples, for examplebuffering only one slot and overwriting the values when thebuffer is full. This do not affect the feasibility of the application using a FPGA implementation. However accordingto [13] "symbols from more than one time slot may be combined non-coherently" only under the following condition.If TI, T2, and T3 are the times required by the three stages,because of clock drift, (TI + T2 + T3) should not be muchlarger than 50 time slots, that means 33.5ms. Our result ismuch larger than this limit, so the FPGA implementationwould not be affordable if we consider the clock drift issue.On the other hand the frame synchronization is performed in6.2ms, and since it is faster than the acquisition of a frameof samples, it does not require additional buffering. Sameconsiderations apply for the scrambling code identificationthat requires 0.75ms. We are assuming here that the threesteps are executed serially. If we consider a single core thecell search on FPGA would require 765ms.
Considering the ASIC implementation, the applicationcan be performed in 40ms on a system running at 200MHz.This means 60 time slots, and it would work fine also considering the clock drift problem.
Generally speaking our results are worse than some dedicated implementations provided in other papers. In particularpipeline implementations are much faster (see [14], 30ms,and [15], 10ms). Nevertheless our platform is intended forgeneral-purpose use and characterized by a very high flexibility. The flexibility is very important when implementingmulti-standard systems like software defined radios.
This paper describes the implementation of the W-CDMAtarget cell search algorithm on a homogeneous, general purpose MPSoC platform design for Software Defined Radioapplications. Proposed multiprocessor system is built using
7802348 I 7.3X I57410380 I
4.2. Cell Search Results
I Cell Search
034
nine nodes, each node using a single COFFEE RISC processor core. The interconnection infrastructure is a hierarchicalNoC, that allows local communication between the core andthe memories and global communication between differentnodes.
We mapped target cell search algorithm on this system.We evaluated the speed-up, comparing our results with a single processor system using the same processing core. Theoverhead due to the communication between different nodeshas been limited, thanks to the broadcast transmission mechanism implemented with the proposed communication infrastructure. In the considered application, 2 out of 3 steps requirethe transmission of the same data stream to the different processing nodes. Using broadcast transmission we were ableto achieve a speed-up of 7.3X very close to the theoreticalmaximum of 9X.
The system has been synthesized for FPGA and ASIC.FPGA results show the execution of the cell search in 104ms,while ASIC results provide a execution time equal to 40ms.
6. REFERENCES
[1] Suman Mamidi, Emily R. Blem, Michael J. Schulte,John Glossner, Daniel Iancu, Andrei Iancu, MayanMoudgill, and Sanjay Jinturkar, "Instruction set extensions for software defined radio on a multithreaded processor," in CASES '05: Proceedings of the 2005 international conference on Compilers, architectures andsynthesis for embedded systems, New York, NY, USA,2005,pp.266-273,ACM.
[2] Yuan Lin, Hyunseok Lee, Yoav Harel, Mark Woh, ScottMahlke, Trevor Mudge, and Krisztian Flautner, "A System Solution for High-Performance, Low Power SDR,"in Proceeding of the SDR 05 Technical Conference andProduct Exposition, 2005.
[3] Chi-Fang Li, Yuan-Sun Chu, Jan-Shin Ho, and WernHo Sheen, "Cell search in wcdma under large-frequencyand clock errors: Algorithms to hardware implementation," Circuits and Systems I: Regular Papers, IEEETransactions on [Circuits and Systems I: FundamentalTheory and Applications, IEEE Transactions on], vol.55, no. 2, pp. 659-671, March 2008.
[4] Gerard K. Rauwerda, Paul M. Heysters, and GerardJ. M. Smit, "Towards Software Defined Radios usingCoarse-grained Reconfigurable Hardware," IEEE Trans.Very Large Scale Integr. Syst., vol. 16, no. 1, pp. 3-13,2008.
[5] Lasse Harju, Programmable Receiver Architectures forMultimode Mobile Terminals., Ph.D. thesis, TampereUniversity of Technology (TUT) , Department of Information Technology, Institute of Digital and Computer
035
Systems (IDCS), Tampere, Finland, August 2006, 160pages, TUT Publication 604, ISSN: 1459-2045, ISBN:952-15-1618-6.
[6] E. Dahlman, P.Berning, J. Knutsson, F. Ovesjo, M. Persson, and C. Roobol, "WCDMA-the radio interface forfuture mobile multimedia communications," VehicularTechnology, IEEE Transactions on, vol. 47, no. 4, pp.1105-1118, November 1998.
[7] Xiaodong Wang, "OFDM and its application to 4G,"in Proc. International Conference on Wireless and Optical Communications 14th Annual WOCC 2005,22-23April 2005, p. 69.
[8] Tapani Ahonen, Designing network-based single-chipsystem architectures, Ph.D. thesis, Tampere University of Technology (TUT) , Department of InformationTechnology, Institute of Digital and Computer Systems(IDCS), Tampere, Finland, October 2006, 242 pages,TUT Publication 625, ISSN: 1459-2045, ISBN: 952-151666-6.
[9] Tapani Ahonen and Jari Nurmi, "Synthesizable switching logic for network-on-chip designs on 90nm technologies," in Proceedings of the 2006 InternationalConference on IP Based SoC Design (IP-SOC '06).6-7December 2006, pp. 299-304, Design and Reuse S.A.
[10] Juha Kylliainen, Tapani Ahonen, and Jari Nurmi,"General-purpose embedded processor cores - the COFFEE RISC example," in Processor Design: System-onChip Computing for ASICs and FPGAs, Jari Nurmi, Ed.,chapter 5, pp. 83-100. Kluwer Academic Publishers fSpringer Publishers, June 2007, ISBN-10: 1402055293,ISBN-13: 978-1-4020-5529-4.
[11] "COFFEE Core Project Website," consulted 20 January2009, URL: http://coffee.tut.fi.
[12] K. Higuchi, Y. Hanada, M. Sawahashi, and F. Adachi,"Experimental evaluation of 3-step cell search methodin w-cdma mobile radio," in Vehicular Technology Conference Proceedings, 2000. VTC 2000-Spring Tokyo.2000 IEEE 51st, 2000, vol. 1, pp. 303-307.
[13] S. Sriram and S. Hosur, "Fast acquisition method for dscdma systems employing asynchronous base stations,"Communications, 1999. ICC '99. 1999 IEEE International Conference on, vol. 3, pp. 1928-1932, 1999.
[14] Y.-P.E. Wang and T. Ottosson, "Cell Search in WCDMA," vol. 18, no. 8, pp. 1470-1482, Aug. 2000.
[15] Sang yun Hwang, Bub ju Kang, and Jae seok Kim,"Performance analysis of initial cell search using timetracker for w-cdma," in Global Telecommunications Conference, 2001. GLOBECOM '01. IEEE, 2001,vol. 5,pp. 3055-3059.