ieee transactions on circuits and systems… · implementation for multiple-antenna systems ......

14
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 56, NO. 3, MARCH 2009 685 Probabilistic Spherical Detection and VLSI Implementation for Multiple-Antenna Systems Chester Sungchung Park, Member, IEEE, Keshab K. Parhi, Fellow, IEEE, and Sin-Chong Park, Member, IEEE Abstract—This paper presents a novel probabilistic spherical- detection (P-SD) method which applies the probabilistic-search algorithm to conventional depth-first SD (DF-SD). By confining the tree search into candidates which can be selected in an adaptive manner, a large number of promising candidates can be evaluated before termination. Consequently, the proposed P-SD improves the error performance of DF-SD with early termination, while retaining the hardware efficiency. An efficient VLSI architecture is proposed for implementation of the P-SD algorithm, and the results of the synthesized architecture are presented. The main advantage of P-SD is that it can fully exploit the state-of-the-art architectures of DF-SD, since it can be implemented by simply adding two functional blocks to conventional DF-SD. By analyzing the performance-complexity tradeoffs, it is concluded that our proposed P-SD is advantageous over conventional DF-SD and -best algorithm, when the maximum-likelihood error perfor- mance is desired. Index Terms—Adaptive signal processing, CMOS digital integrated circuits, maximum-likelihood (ML) detection, mul- tiple-input–multiple-output (MIMO) systems. I. INTRODUCTION N OWADAYS, multiple-input–multiple-output (MIMO) communications have emerged as one of the key ele- ments for the next-generation wireless communications, since the channel capacity is proportional to the minimum number of antennas [1]. Practically, the use of multiple antennas can significantly improve the coverage by diversity gain and/or increase the transmission rate by multiplexing gain [2]. Current standardization activities in wireless local access networks and wireless cellular networks have chosen the use of multiple antennas to further improve the throughput of current wireless networks [3], [4]. In most of practical MIMO systems, forward-error-correc- tion (FEC) codes such as convolutional codes and low-density Manuscript received October 21, 2006; revised October 30, 2007. First pub- lished August 04, 2008; current version published March 11, 2009. This work was supported in part by the System Integration Technology Institute (SITI), Korea. This paper was presented in part at the ISCAS, Kos, Greece, May 2006 and at the ICASSP, Toulouse, France, May 2006. This paper was recommended by Associate Editor A.-Y. Wu. C. S. Park was with the Samsung Advanced Institute of Technology, Yongin 446-712, Korea. He is now with Ericsson Research, Research Triangle Park, NC 27709 USA (e-mail: [email protected]). K. K. Parhi is with the Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN 55455 USA (e-mail: parhi@umn. edu). S.-C. Park is with the School of Engineering, Information and Communica- tions University, Daejeon 350-713, Korea (e-mail: [email protected]). Digital Object Identifier 10.1109/TCSI.2008.2002544 parity-check (LDPC) codes [3] are used for reliable communi- cations. Accordingly, in order to achieve near-channel-capacity performance, the maximum a posteriori (MAP) processing needs to be performed at the receiver side [5]. Recently, the im- plementation of MAP processing has been facilitated by using the so-called spherical detection (SD), an efficient tree-search algorithm that finds a set of vector symbols (referred to as a list in this paper) closest to the received vector symbol [6], [7](this tree-search algorithm is often called list SD (LSD), a modified version of standard SD [8]; in this paper, the terms SD and LSD are used interchangeably, unless otherwise mentioned). The utilization of SD enables accurate calculation of soft-de- cision values for the subsequent FEC decoding, approaching maximum-likelihood (ML) error performance. Recalling that SD is categorized as a sequential-decoding al- gorithm, several tree-search strategies introduced in [9] and [10] can be considered for efficient implementation. A depth-first (DF) search strategy has been of great interest to several re- search groups, since it is attractive in terms of both error per- formance and hardware complexity [11]–[13]. The disadvan- tage of DF search strategy is that the computation amount (or, equivalently, the throughput of hardware for SD) is random and unpredictable. For this reason, a breadth-first search strategy utilized by -best algorithm [14]–[17] and its variation [18] has gained much attention recently as an alternative, since it al- ways guarantees a constant throughput. However, the compu- tational complexity of breadth-first search strategy is prohib- itively large because of unavoidable needs for sorting. Thus, the DF search strategy is still attractive from the perspective of hardware implementation, and the DF-SD with a constant throughput [19]–[21] is of significant interest. In order to guarantee a constant throughput, the tree search of DF-SD needs to be terminated earlier, i.e., before evaluating all possible candidates for vector symbols (not always but with a high probability) [11]–[13]. Thus, the error performance of DF-SD with early termination is far from that of MAP pro- cessing, since the optimality of the resulting list (in terms of dis- tance) is not guaranteed. Although there are a number of reports about standard DF-SD in the literature, relatively less attention has been paid to DF-SD with early termination [19]–[21], which motivated this paper. In this paper, in order to improve the error performance, the probabilistic-search algorithm in [22] is applied to the state-of-the-art DF-SD in [11]–[13]: By confining the search into channel-adaptively selected promising candidates, a larger number of promising candidates can be evaluated before ter- mination, which makes the resulting list closer to the optimum one. When applied to LDPC-coded MIMO orthogonal-fre- 1549-8328/$25.00 © 2009 IEEE Authorized licensed use limited to: Ericsson Research. Downloaded on March 11, 2009 at 10:45 from IEEE Xplore. Restrictions apply.

Upload: trankhuong

Post on 01-Sep-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 56, NO. 3, MARCH 2009 685

Probabilistic Spherical Detection and VLSIImplementation for Multiple-Antenna Systems

Chester Sungchung Park, Member, IEEE, Keshab K. Parhi, Fellow, IEEE, and Sin-Chong Park, Member, IEEE

Abstract—This paper presents a novel probabilistic spherical-detection (P-SD) method which applies the probabilistic-searchalgorithm to conventional depth-first SD (DF-SD). By confiningthe tree search into candidates which can be selected in an adaptivemanner, a large number of promising candidates can be evaluatedbefore termination. Consequently, the proposed P-SD improvesthe error performance of DF-SD with early termination, whileretaining the hardware efficiency. An efficient VLSI architectureis proposed for implementation of the P-SD algorithm, and theresults of the synthesized architecture are presented. The mainadvantage of P-SD is that it can fully exploit the state-of-the-artarchitectures of DF-SD, since it can be implemented by simplyadding two functional blocks to conventional DF-SD. By analyzingthe performance-complexity tradeoffs, it is concluded that ourproposed P-SD is advantageous over conventional DF-SD and

-best algorithm, when the maximum-likelihood error perfor-mance is desired.

Index Terms—Adaptive signal processing, CMOS digitalintegrated circuits, maximum-likelihood (ML) detection, mul-tiple-input–multiple-output (MIMO) systems.

I. INTRODUCTION

N OWADAYS, multiple-input–multiple-output (MIMO)communications have emerged as one of the key ele-

ments for the next-generation wireless communications, sincethe channel capacity is proportional to the minimum numberof antennas [1]. Practically, the use of multiple antennas cansignificantly improve the coverage by diversity gain and/orincrease the transmission rate by multiplexing gain [2]. Currentstandardization activities in wireless local access networks andwireless cellular networks have chosen the use of multipleantennas to further improve the throughput of current wirelessnetworks [3], [4].

In most of practical MIMO systems, forward-error-correc-tion (FEC) codes such as convolutional codes and low-density

Manuscript received October 21, 2006; revised October 30, 2007. First pub-lished August 04, 2008; current version published March 11, 2009. This workwas supported in part by the System Integration Technology Institute (SITI),Korea. This paper was presented in part at the ISCAS, Kos, Greece, May 2006and at the ICASSP, Toulouse, France, May 2006. This paper was recommendedby Associate Editor A.-Y. Wu.

C. S. Park was with the Samsung Advanced Institute of Technology, Yongin446-712, Korea. He is now with Ericsson Research, Research Triangle Park, NC27709 USA (e-mail: [email protected]).

K. K. Parhi is with the Department of Electrical and Computer Engineering,University of Minnesota, Minneapolis, MN 55455 USA (e-mail: [email protected]).

S.-C. Park is with the School of Engineering, Information and Communica-tions University, Daejeon 350-713, Korea (e-mail: [email protected]).

Digital Object Identifier 10.1109/TCSI.2008.2002544

parity-check (LDPC) codes [3] are used for reliable communi-cations. Accordingly, in order to achieve near-channel-capacityperformance, the maximum a posteriori (MAP) processingneeds to be performed at the receiver side [5]. Recently, the im-plementation of MAP processing has been facilitated by usingthe so-called spherical detection (SD), an efficient tree-searchalgorithm that finds a set of vector symbols (referred to as a listin this paper) closest to the received vector symbol [6], [7](thistree-search algorithm is often called list SD (LSD), a modifiedversion of standard SD [8]; in this paper, the terms SD andLSD are used interchangeably, unless otherwise mentioned).The utilization of SD enables accurate calculation of soft-de-cision values for the subsequent FEC decoding, approachingmaximum-likelihood (ML) error performance.

Recalling that SD is categorized as a sequential-decoding al-gorithm, several tree-search strategies introduced in [9] and [10]can be considered for efficient implementation. A depth-first(DF) search strategy has been of great interest to several re-search groups, since it is attractive in terms of both error per-formance and hardware complexity [11]–[13]. The disadvan-tage of DF search strategy is that the computation amount (or,equivalently, the throughput of hardware for SD) is random andunpredictable. For this reason, a breadth-first search strategyutilized by -best algorithm [14]–[17] and its variation [18]has gained much attention recently as an alternative, since it al-ways guarantees a constant throughput. However, the compu-tational complexity of breadth-first search strategy is prohib-itively large because of unavoidable needs for sorting. Thus,the DF search strategy is still attractive from the perspectiveof hardware implementation, and the DF-SD with a constantthroughput [19]–[21] is of significant interest.

In order to guarantee a constant throughput, the tree searchof DF-SD needs to be terminated earlier, i.e., before evaluatingall possible candidates for vector symbols (not always but witha high probability) [11]–[13]. Thus, the error performance ofDF-SD with early termination is far from that of MAP pro-cessing, since the optimality of the resulting list (in terms of dis-tance) is not guaranteed. Although there are a number of reportsabout standard DF-SD in the literature, relatively less attentionhas been paid to DF-SD with early termination [19]–[21], whichmotivated this paper.

In this paper, in order to improve the error performance,the probabilistic-search algorithm in [22] is applied to thestate-of-the-art DF-SD in [11]–[13]: By confining the searchinto channel-adaptively selected promising candidates, a largernumber of promising candidates can be evaluated before ter-mination, which makes the resulting list closer to the optimumone. When applied to LDPC-coded MIMO orthogonal-fre-

1549-8328/$25.00 © 2009 IEEE

Authorized licensed use limited to: Ericsson Research. Downloaded on March 11, 2009 at 10:45 from IEEE Xplore. Restrictions apply.

686 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 56, NO. 3, MARCH 2009

quency-division-multiplexing (OFDM) systems, the proposedprobabilistic SD (P-SD) significantly improves the error perfor-mance of DF-SD, while maintaining the hardware efficiency.The VLSI architecture of P-SD is also proposed along withits synthesis results, showing that conventional DF-SD canbe modified by simply adding two functional blocks. Further-more, in order to compare with conventional SD, the tradeoffsbetween error performance and computational complexityare addressed. It is shown that the proposed P-SD provides asignificant gain in either required SNR or computation amount.

The remainder of this paper is organized as follows. InSection II, the principle of SD is explained, focusing on theapplication to MAP processing. The conventional DF-SD andprobabilistic-search algorithm are introduced in Section IIIand Section IV, respectively. Section V provides the solutionto improving the error performance of DF-SD with earlytermination, the so-called P-SD, which is verified by our sim-ulation results in Section VI. The VLSI architecture for P-SDis proposed, along with the synthesis results, in Section VII.Additionally, the tradeoffs between error performance andcomputational complexity are presented and compared withthose of conventional SD, DF-SD, and -best algorithm inSection VIII. Finally, Section IX concludes this paper.

II. PRINCIPLES OF SD

In a MIMO system with transmit antennas andreceive antennas, the received vector symbol,

, is given by

(1)

where is thechannel-coefficient matrix, is thetransmitted vector symbol whose elements are drawn fromthe constellation of -ary quadrature amplitude modulation(QAM), and is the additive whiteGaussian noise. Here, is obtained by

(2)

where represents the bit-to-symbol mapping of -aryQAM [3] and denotes the th-coded (and interleaved) bittransmitted at the th transmit antenna. According to the deriva-tion described in [6], the soft-decision values for the subsequentdecoding are given as

(3)

which is the Max-log approximation of a posteriori log-likeli-hood ratio (LLR) [13], [17] (note that the iterative demodula-tion and decoding method introduced in [6] is not consideredthroughout this paper because of difficulty in implementation).In the right-hand side of (3), the former maximization is per-formed over all candidates in , whilethe latter maximization is over the other candidatesin . By utilizing the SD shown in [6], [13],

[15], the soft-decision values in (3) can be calculated approx-imately yet efficiently: The maximization of squared distance

is performed over only the candidates ina list denoted by rather than over all possible candidates, inother words

(4)

As mentioned earlier, the list is composed of candidates min-imizing the (squared) distance. Hence, it is easily understoodthat as the number of candidates in the list increases, thesoft-decision values in (4) approach the Max-log-approximatedLLR in (3).

According to the QR decomposition, it follows that

(5)

where is uni-tary (i.e., ) and

is upper triangular. If is equal to , canbe expressed as

(6)

where is given by . Owingto the upper triangularity of , the calculation of can beperformed recursively: Defining a partial vector symbol in theth level as

(7)

the corresponding partial squared distance in the th-levelis recursively calculated as

(8)where the (squared) distance increment is defined as

(9)

Note that is set to zero, and thus, isnothing but from (8) and (9). Considering the sphericalconstraint with a radius

(10)

then it immediately follows that the partial vector symbols thatviolate the constraint

(11)

also violate the spherical constraint (10). This is the ideabehind the tree pruning of SD, which significantly reduces thecomputation amount [6], [7]. Even though the discussion in thissection assumes single-carrier transmission over frequency-flatfading channels, the extension to multicarrier transmission(i.e., OFDM) over frequency-selective fading channels is quite

Authorized licensed use limited to: Ericsson Research. Downloaded on March 11, 2009 at 10:45 from IEEE Xplore. Restrictions apply.

PARK et al.: P-SD AND VLSI IMPLEMENTATION FOR MULTIPLE-ANTENNA SYSTEMS 687

Fig. 1. Example of tree traversal of DF-SD (� � �, � � �). (a) Searchtree. (b) Storage elements. For clarity, the tree pruning and radius reduction arenot shown.

straightforward. Thus, the focus of this paper is given to SDitself, apart from any specific applications.

III. DF-SD

The DF-SD considered in this paper is described in this sec-tion, taking into account the VLSI architecture as well as thealgorithm. The VLSI architectures that enable high-speed op-eration in [11] and [12] (e.g., one-node-per-cycle architecture)are fully utilized. Furthermore, several useful algorithms suchas Schnorr–Euchner (SE) ordering [23] and radius reduction arealso considered.

The tree traversal assumed throughout this paper is shown inFig. 1, assuming three transmit antennas and receive antennas

and QPSK . In Fig. 1(a), the numbersinside the circles represent the alphabets of QPSK constellation.Each node in the th level represents a partial vector symbol

, and child nodes corresponding to partial vector sym-bols in the th level are derived (as connected throughbranches in the figure) from a parent node corresponding to apartial vector symbol in the th level . For example,four child nodes corresponding to , , , and

in the first level (bottom level) are derived from a parentnode corresponding to in the second level. A shaded nodein the th level has the smallest among child nodes inthe th level. Accordingly, the vector symbol indicated by threeshaded nodes is nothing but the well-known nullingand cancellation solution. For the purpose of clear illustration,the tree pruning and radius reduction are not shown in Fig. 1.

In order to fully reuse the intermediate calculations, allchild nodes from a parent node are assumed to be evaluatedsimultaneously, as in [11] and [12], throughout this paper. Tobe specific, defining as

(12)

then, it readily follows from (9) that

(13)

It is easily understood that the first term in the right-hand sideof (13) is common to all child nodes, while the other

two terms are unique to a specific child node (due to their de-pendence on the th symbol . As a result, it is desirable to cal-culate , and thus, for all child nodes simulta-neously, from the perspective of computational complexity [11]and [12]. The numbers in the parentheses shown in Fig. 1 indi-cate the instant (in clock cycle) when the corresponding node isvisited as a parent node. Since the one-node-per-cycle operationin [12] and [13] is assumed throughout this paper, the requirednumber of clock cycles is always equal to the number of visitedparent nodes.

The tree-search strategy of DF-SD combined with the SE or-dering is performed as shown in Fig. 1(a). The utilization ofstorage elements corresponding to the tree traversal is shown inFig. 1(b). Suppose that a parent node in the th level is vis-ited, and its child nodes in the th level are evaluated.If the child nodes are not in the first level (bottom level), i.e.,

, and there exists at least one child node that satisfies thespherical constraint (11), i.e., it has a value of smallerthan , then the forward recursion is performed. In the forwardrecursion of DF-SD, the closest child node, i.e., the child nodecorresponding to the partial vector symbol minimizing ,is visited as the next parent node, and the intermediate calcula-tions of the remaining child nodes (also re-ferred to as the yet-to-be-visited nodes) are stored until they arevisited as parent nodes. On the other hand, if the child nodes arein the first level or all the child nodes violate the spher-ical constraint (11), then the backward recursion is performed.In the backward recursion of DF-SD, a preferred child node inthe upper level (that is the closest to the current level) is revisitedas the next parent node. Throughout this paper, a preferred childnode in each level is defined as a node whose corresponding par-tial distance is the smallest among the yet-to-be-visited nodes inthe level and, of course, satisfies the sphere constraint. Note that,once the tree traversal is started, the original DF-SD (i.e., DF-SDwithout early termination) continues the search until there re-main no more yet-to-be-visited nodes, satisfying the sphere con-straint or, equivalently, no more preferred nodes. The earlierdescription of tree traversal of DF-SD is summarized in the con-trol flow shown in Fig. 2. This is how the conventional DF-SDkeeps the required number of storage elements (in the numberof nodes) as low as .

It is widely understood that the number of visited parentnodes of DF-SD can be significantly reduced by the help ofradius reduction. In this paper, the radius reduction of standardDF-SD shown in [12] is generalized to DF-SDwith arbitrary . Note that the only difference from stan-dard DF-SD is that the largest distance in a list is chosen as anew radius each time the list is updated. Consequently, as thenumber of candidates in a list increases, the radius reductionbecomes slower, resulting in a larger number of clock cycles(and, thus, higher computational complexity).

IV. PROBABILISTIC-SEARCH ALGORITHM

As pointed out in [11] and [12], the number of clock cyclesneeds to be constrained to a predetermined value for a constantthroughput, since, otherwise, it tends to randomly fluctuate withmultipath fading and additive noise. In other words, the tree tra-versal needs to be terminated once parent nodes are visited.

Authorized licensed use limited to: Ericsson Research. Downloaded on March 11, 2009 at 10:45 from IEEE Xplore. Restrictions apply.

688 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 56, NO. 3, MARCH 2009

Fig. 2. DF search strategy combined with the SE ordering.� represents a setof yet-to-be-visited nodes in the �th level.

Consequently, the DF-SD with early termination cannot guar-antee the optimality of the resulting list (in terms of distances).More specifically, closer candidates (i.e., candidates with dis-tances smaller than the radius) may remain to be evaluated afterearly termination of tree traversal, thus leading to degraded errorperformance [11]. In the previous example shown in Fig. 1, theevaluation of all possible candidates takes 21 clock cycles. Ifthe number of visited parent nodes is constrained to be 11 (i.e.,

), only 32 candidates are evaluated and the 32 remainingcandidates are not considered. In general, the error performanceis degraded more severely in DF-SD, as becomes smaller.

Accordingly, the main focus of this paper is the way ofimproving the error performance of DF-SD with early termi-nation. It is easily understood that as a larger number of closercandidates can be evaluated before termination, the subsequentcalculation of soft-decision values (based on the resulting list)becomes more reliable, possibly leading to improved errorperformance.

In this paper, the probability of transmission, which is de-fined as the probability that a particular candidate is indeedtransmitted, is used to distinguish more promising candidatesand evaluate them earlier. By evaluating the candidates in a de-scending order of this probability, the error performance can beimproved, approaching that of the MAP processing. The ideabehind this probabilistic-search algorithm proposed in [22] is asfollows. It can be intuitively said that a candidate with a higherprobability of transmission is more promising in that the dis-tance is, with a high probability, smaller. Therefore, if the can-didates with a higher probability of transmission are evaluatedearlier, the resulting list becomes more similar to the optimumone. Furthermore, since a candidate with a higher probability islikely to be closer to the received vector symbol, the radius ofsphere tends to be reduced more rapidly, so that a larger numberof promising candidates can be evaluated before termination.

In [22], it is shown that the probability of transmission of eachcandidate is obtained as a closed-form expression. To be spe-

Fig. 3. Evaluation of probabilities of transmission (� � �, � � �) for� � ���, � � ���, and � � ���. The probability decreases exponen-tially with ���� given at the bottom.

cific, the probability of transmission of decreases exponen-tially with

(14)

where is defined as the distance of from the closest nodeamong all the child nodes derived from a parent node, i.e.,shaded nodes shown in Fig. 1. Since all child nodes areevaluated within a single clock cycle, we are more interestedin the probability of transmission of a partial vector symbol inthe second level, i.e., the probability that the transmitted vectorsymbol is derived from a node in the second level . From(14), it follows that this probability decreases exponentially with

(15)

where denotes a vector symbol ofdistance in the second level. Note that the definition ofis based on the dependence of on . The most inter-esting part of the closed-form expression is that the probabilityof transmission can be easily calculated for all possible can-didates, without evaluating their partial distance, if only theclosest nodes as well as the channel coefficients are known tothe receiver. Fig. 3 shows as a measure of probability forthe previous example, assuming that , , and

. Recall that is either 1 or 1.414 in the normalizedQPSK constellation.

Let us revisit the tree traversal of DF-SD shown in Fig. 1 fromthe viewpoint of probability of transmission. If the tree traversalof DF-SD is constrained to 11 clock cycles, many promisingcandidates such as and remain to be evaluated aftertermination. It is important to consider the well-known statis-tical property [24] that

(16)

Since is smaller in the upper level, the candidates de-viated from the closest child node in the upper level are morelikely to have high probabilities of transmission than those de-viated in the lower level. For example, in Fig. 3, (deviatedin the third level) has a higher probability of transmission than

(deviated in the second level). Thus, it can be concludedthat the tree-search strategy of DF-SD, which first searches thecandidates deviated in the lower level (for better utilization ofstorage elements, as mentioned previously), is far from optimumin the case of early termination.

Authorized licensed use limited to: Ericsson Research. Downloaded on March 11, 2009 at 10:45 from IEEE Xplore. Restrictions apply.

PARK et al.: P-SD AND VLSI IMPLEMENTATION FOR MULTIPLE-ANTENNA SYSTEMS 689

Fig. 4. Example of tree traversal of probabilistic-search algorithm (� � �,� � �). (a) Search tree. (b) Storage elements. For clarity, the tree pruning andradius reduction are not shown.

Suppose that the probabilistic-search algorithm in [22] isdirectly applied to the state-of-the-art DF-SD in [11] and [12].Then the tree-search strategy is based on the probability oftransmission, not the DF search strategy. Fig. 4 shows the treetraversal of probabilistic-search algorithm for the previousexample, along with the corresponding utilization of storageelements. If the one-node-per-cycle operation is guaranteed, theevaluation of all possible candidates takes 21 clock cycles, justas in the DF-SD in Fig. 1. In this probabilistic-search algorithm,however, the candidates are evaluated in quite a different order,in more detail, which is in a descending order of probability oftransmission. As a result, a larger number of promising candi-dates (i.e., candidates with a higher probability of transmission)can be evaluated before termination. For example, as shown inFig. 4(a), all the candidates with smaller than or equal totwo (in the order of , , , , , and

) can be evaluated within 11 clock cycles. As shown inFig. 4(b), however, the required number of storage elements iseven larger than that of conventional DF-SD (i.e., ,as mentioned previously), since the tree traversal is solely basedon the probability of transmission (rather than efficient memoryutilization). As an example, note that, in Fig. 4(a), a partialvector symbol needs to be stored for the first ten clockcycles in this probabilistic-search algorithm, while only for thefirst four clock cycles in DF-SD. Moreover, since the requirednumber of storage elements fluctuates randomly with multipathfading and additive noise, the probabilistic-search algorithm,more accurately, a direct application of probabilistic-searchalgorithm to DF-SD, is difficult to implement efficiently, inspite of significantly improved error performance.

V. PROPOSED P-SD

In this section, the probabilistic-search algorithm shownin the previous section is combined with the state-of-the-artDF-SD, taking into consideration efficient hardware imple-mentation. The proposed SD, the so-called P-SD, takes fulladvantages of both the conventional DF-SD (e.g., efficientmemory utilization) and the probabilistic-search algorithm(e.g., improved error performance). The proposed P-SD is quitesimilar to the conventional DF-SD, and the only difference

Fig. 5. Definition of subsets for� -ary QAM. The constellation is divided intoa few subsets, ���������, which consist of all the alphabets with an identicaldistance from a reference alphabet ����.

from the conventional DF-SD is that only a few promisingcandidates are evaluated while the other candidates are prunedin the tree traversal. By confining the tree traversal to a prede-termined number of promising candidate groups and followingthe DF search strategy, our proposed P-SD can evaluate a largernumber of promising candidates before termination, whileretaining efficient utilization of storage elements.

Before discussing the details of P-SD, let us define a candi-date group. Recall that the constellation of -ary QAM canbe divided into a few subsets (namely, shown inFig. 5) of all the alphabets with an identical distance from areference alphabet. When the reference alphabet itself is con-sidered as a subset , it follows that has four alphabetsand a distance of one, has four alphabets and a distanceof , has four alphabets and a distance of two, haseight alphabets and a distance of , and has four alpha-bets and a distance of . Notice that the normalization factoris chosen to simplify the calculation of , without loss ofgenerality. Then, it readily follows that there are 3 possible sub-sets, , for QPSK (as exemplified in Fig. 5), 10 pos-sible subsets, , for 16-QAM, and 33 possible sub-sets, , for 64-QAM. Based on these subsets, a can-didate group can be defined as a group of partial vector symbols(in the second level) of which the alphabets in each level belongto an identical subset. Therefore, a candidate group can be rep-resented by a vector symbol of subsets. Accordingly,when the closest nodes in the tree (represented by shaded nodesin the previous figures) are chosen as the reference alphabets,a candidate group is nothing but a group of candidates with anidentical probability of transmission. In the previous example,a candidate group, , consists of two partial vectorsymbols, and , or equivalently, eight candidates de-rived from them, , , , , ,

, , and . Note that and havean identical probability of transmission, i.e., , asshown in Fig. 3.

Now, we are ready to describe the tree traversal of ourproposed P-SD in detail. As mentioned before, the proposedP-SD can be seen as a special case of DF-SD of which the treetraversal is confined into a few promising candidate groups.Fig. 6 shows the tree traversal and corresponding utilizationof storage elements of P-SD with four promising candidate

Authorized licensed use limited to: Ericsson Research. Downloaded on March 11, 2009 at 10:45 from IEEE Xplore. Restrictions apply.

690 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 56, NO. 3, MARCH 2009

Fig. 6. Example of tree traversal of proposed P-SD (� � �, � � �).(a) Search tree. (b) Storage elements. For clarity, the tree pruning and radiusreduction are not shown.

groups of , , , and. As shown in Fig. 3, these candidate groups have

higher probabilities of transmission than the other candidategroups; specifically, is smaller than or equal to two.The other candidate groups indicated by “ ” marks in thefigure, the less promising candidate groups, are pruned in thetree. Interestingly, as shown in Fig. 6, the order of evaluationof candidates is somewhat different from that of the proba-bilistic-search algorithm; in other words, the tree traversal ofP-SD is not performed in a probabilistic manner: The can-didates are evaluated in the order of ,

, ,and in the proposed P-SD, while theyare in the order of , , ,and in the probabilistic-search algorithm. Never-theless, it is easily found that the proposed P-SD can evaluatea larger number of promising candidates before termination bysimply pruning the less-promising candidates. All the candidategroups with smaller than or equal to two can be success-fully evaluated within 11 clock cycles as shown in Fig. 6, justas in the probabilistic-search algorithm. Undoubtedly, the cor-responding utilization of storage elements is as efficient as thatof the conventional DF-SD, since the tree traversal basicallyfollows the DF strategy as mentioned earlier. In other words,the required number of storage elements is always kept as lowas . The problems that remain to be solved arehow to select a predetermined number of promising candidategroups and how to confine the tree traversal to the selectedcandidate groups.

The first problem, how to select a predetermined number ofpromising candidate groups, is considered here. As previouslyshown in (15), the promising candidate groups are chosen basedon the channel coefficients available to the receiver. Denotingthe number of promising candidate groups by , the numberof subsets to be considered is always equal to , since anycandidate groups with subsets larger than (i.e., largerin terms of distances) do not need to be considered (note thatthe promising candidate groups always consist of subsetssmaller than or equal to , since always increaseswith the distances of subsets). Now, the problem is equivalentto choosing the candidate groups that minimize among

possible candidate groups (note that a candidate groupconsists of subsets). This problem can be efficiently

Fig. 7. Tree search for selecting promising candidate groups (� � �, � �

�). The numbers inside the circles in the �th level represent the squared distancesof the corresponding subsets � .

solved by a tree search with levels and nodes in eachlevel. From (15), can be recursively calculated as

(17)where is a partial vector symbolof distance in the th level. Additionally, isset to zero, and is nothing but . In this paper, thetree search is performed using the so-called -best algorithm[15]–[17]. Specifically, each node in the th level isvisited as a parent node, and its child nodes are evaluated. Aftervisiting all the nodes from to , partialvector symbols in the th level that minimize areobtained. This procedure is repeated from to ,as shown in (17). However, as opposed to standard -bestalgorithms, the tree search always guarantees the optimality ofselected candidate groups, since is not jointly evaluatedover multiple antennas, as shown in (15). Fig. 7 shows anexample of the tree search with nine candidate groups ( ,and thus, ), for four transmit and receive antennas

and 64-QAM . In this figure, thetop level represents the fourth level , while the bottomlevel represents the second level . The numbers insidethe circles in the th level represent the squared distances ofthe corresponding subsets . To be specific, for ,

for , for , for , for, for , for , for , and

for . Here, note that the number of child nodesvaries with the corresponding parent node: Nine child nodesfrom , four child nodes from , three child nodes from

, two child nodes from and , and a single childnode from the four remaining subsets. It implies that only 24partial vector symbols (among 81 partial vector symbols) needto be evaluated in the second and third levels. For example,

in the third level does not need to be evaluated,since nine more promising candidate groups have already beenevaluated, e.g., , , ,

, , , ,, and . This significantly simplifies

the application of -best algorithm. Furthermore, sincein (17) simply increases with the distance of subset, the treetraversal can be performed without any sorting. Therefore, the

Authorized licensed use limited to: Ericsson Research. Downloaded on March 11, 2009 at 10:45 from IEEE Xplore. Restrictions apply.

PARK et al.: P-SD AND VLSI IMPLEMENTATION FOR MULTIPLE-ANTENNA SYSTEMS 691

computational complexity of selecting promising candidategroups is even lower than that of standard -best algorithm.

Once a predetermined number of promising candidate groupsare selected, the second problem is on how to confine tree tra-versal into them. Each time the backward recursion of tree tra-versal is performed, it is necessary to determine whether thepreferred child nodes in the upper levels belong to the selectedcandidate groups. If there exists any preferred node belongingto the selected candidate groups, it is chosen as the next parentnode, just as in the conventional DF-SD. Otherwise, the pre-ferred node is simply ignored as if it is pruned in the tree. Forthis reason, it is essential to determine which subset the pre-ferred node belongs to, since a candidate group is representedby a vector of subsets. Fortunately, this can be easily done if thetree traversal follows the SE ordering, which is often adoptedin most of the conventional DF-SD architectures [11]–[13]. Theidea behind this is as follows: The distance incrementin (9) usually tends to increase with the distance of subsets,since the closest node that minimizes is chosen as thereference alphabet. Recalling that the SE ordering inherentlyvisits the nodes in a descending order of , it is possible toconfine the search into the selected candidate groups by simplykeeping track of the number of visited child nodes derived fromeach node. In the previous example of Fig. 6, up to three childnodes derived from need to be visited as a parent node,since and are selected as promisingcandidate groups. Consequently, as shown in Fig. 6, isnot chosen as the next parent node at the fifth clock cycle, sincethree child nodes , , and have already beenvisited according to the SE ordering.

Before going to Section VI, it is worth to compare theproposed P-SD with the so-called maximum-first (MF) strategy[19]–[21]. The MF strategy efficiently schedules the maximumnumber of clock cycles within a block of consecutive vectorsymbols, based on the accumulated number of clock cycles andthe number of remaining vector symbols. Hence, it significantlyimproves the error performance of early termination, whenconsecutive vector symbols experience independent fading,i.e., when the required number of clock cycles fluctuates withina block. However, the performance gain is much affected bythe fading dependence and becomes trivial in the absence ofindependent fading. On the other hand, P-SD allocates compu-tation resources within a vector symbol (by confining the treesearch into promising candidate groups), whereas MF strategyallocates them across consecutive vector symbols (by adjustingthe number of clock cycles). Thus, it is readily understood thatthe proposed P-SD improves the error performance of early ter-mination independently of the fading environment. Accordingto simulation results, it turns out that P-SD outperforms MFstrategy in slow fading channels, while MF strategy performsbetter than P-SD in fast fading channels, which is not presentedhere due to space limitation.

VI. EVALUATION OF ERROR PERFORMANCE

In this section, the error performance of P-SD is evaluated andcompared with conventional DF-SD for LDPC-coded MIMO-OFDM systems, relying on the simulation results. Here, fourtransmit and receive antennas , 64-QAM with

Fig. 8. Error probability of conventional DF-SD with early termination (� �� � �,� � ��,� � ��). ��-���� denotes DF-SD with the allow-able number of clock cycles � .

gray encoding , OFDM modulation with 64 subcar-riers, and a (1944, 972) irregular LDPC code are assumed, eventhough the proposed P-SD is generally applicable to differentnumbers of antennas and different modulation orders as in [11]and [13] (for the detailed information of system parameters usedin our simulations, refer to [25]). The multipath fading channelsare assumed to be frequency-selective (with a delay spread of250 ns) and spatially uncorrelated due to rich scattering. The listof SD is assumed to contain 16 candidates (whichwas empirically determined, based on our simulations and otherreferences including [6] and [15]), and the belief propagationalgorithm (the so-called sum–product decoding) [26] with teniterations is adopted for the subsequent decoding. The targetbit-error rate (BER) is set to . In the remainder of this paper,the system parameters are assumed as given in this section, un-less mentioned otherwise.

In Fig. 8, the error performance of conventional DF-SDin [11]–[13] is evaluated. Here, - denotes DF-SDwith the allowable number of clock cycles . For instance,

- corresponds to the simplest case where the treesearch is done over only candidates (including the nullingand cancellation solution), while - correspondsto the ideal case where the tree search is not terminated im-properly, i.e., all possible candidates are evaluated. Whenis equal to 50, the error performance of DF-SD is significantlydegraded from the ideal case, - , to be specific, byan SNR of 13 dB. As increases up to 100 and 200, the errorperformance is improved by 9 and 10 dB, respectively. Notethat, even when is as high as 200, the performance loss fromthe ideal case is nontrivial (about 3 dB), as shown in the figure.

In Fig. 9, the error performance of P-SD is evaluated, whenthe number of promising candidate groups is given as ,3, 9, 20, and 100. Here, - denotes P-SD with twoparameters and . The figure shows that - isthe best choice in terms of error performance. Generally, asdecreases from this optimum value (up to one), the error per-formance of - tends to approach that of - ,

Authorized licensed use limited to: Ericsson Research. Downloaded on March 11, 2009 at 10:45 from IEEE Xplore. Restrictions apply.

692 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 56, NO. 3, MARCH 2009

Fig. 9. Error probability of P-SD with different numbers of candidate groups(� � � � �, � � ��, � � ��). �-������ denotes P-SD withtwo parameters � and � .

since the tree search is further confined into the candidatesaround the nulling and cancellation solution. On the other hand,as increases from this optimum value, the error performanceof - tends to approach that of - . Thisis not surprising at all, since a sufficiently large number ofpromising candidates make the tree traversal similar to thatof DF-SD and not that of probabilistic-search algorithm. Inaddition, it is clearly shown in the figure that our proposedP-SD, i.e., - , significantly outperforms -by 10 dB. Noticing that the error performance of -is attained by - , it is concluded that our proposedP-SD significantly reduces the required number of clock cycles,i.e., by a factor of four. Fig. 9 also shows that the performancedegradation due to early termination can be successfully com-pensated: The error performance of - approachesthat of ideal DF-SD within 3 dB, without further increasing thenumber of clock cycles. In the remainder of this paper, isassumed to be the optimum value, i.e., .

Before concluding this section, let us consider a nonadaptivealternative of the proposed P-SD that selects the promisingcandidate groups based on the average power of channelcoefficients , instead of the instan-taneous channel coefficients . Recallingthe well-known statistical property in (16), the promising can-didate groups can be selected without estimating the channelcoefficients. This statistical alternative is advantageous overthe proposed P-SD in that the computation for preprocessingstage can be saved. As shown in Fig. 10, however, the errorperformance of this alternative is even worse than that of theoriginal P-SD. Here, - denotes a statistical versionof P-SD with two parameters and . As a consequence,it easily follows that, in spite of additional computation forpreprocessing, the P-SD performs better when the promisingcandidate groups are selected based on the instantaneouschannel coefficients rather than their statistics (e.g., averagepower). Hence, it can be said that the adaptive control of DF

Fig. 10. Error probability of nonadaptive alternatives to P-SD (� � � ��,� � ��,� � ��). ������� denotes a statistical version of P-SDwith two parameters � and � .

search strategy (by confining the search into channel-depen-dently selected candidate groups) leads to the improved errorperformance in our proposed P-SD.

VII. PROPOSED VLSI ARCHITECTURE

The VLSI architecture suitable for P-SD is proposed inthis section. As shown in Fig. 11, the hardware for P-SDconsists of several functional blocks. The operation of thesefunctional blocks can be described by three stages, in otherwords, , , and

. In , a prede-termined number of promising candidate groups are selected

based on the channel coefficients. Since thewireless channel changes slowly (compared with data trans-mission), is activated only when thechannel is significantly changed, or the preamble is periodicallyreceived. In , all the child nodes derivedfrom a parent node are evaluated , and the nextparent node is chosen . In , theintermediate calculations are stored in the forward recursion

, the preferred nodes are newly chosen ( or), and the list (and therefore, the radius of sphere) is

updated . Hence, the proposed VLSI architecture ofP-SD is characterized by two-stage pipeline architecture, whichenables the one-node-per-cycle operation [12]. Additionally,in the proposed architecture, the intermediate calculations arestored and revisited in the form of instructions, which consist oftheir corresponding partial vector symbols and partial distances,as depicted in [11]. In Fig. 11, the thick arrows indicate theinstructions, for example, represents an instructionfor the next parent node. The detailed description of eachfunctional block is presented next.

1) calculates the instructions corresponding tochild nodes from a parent node , based on areceived vector symbol and an instruction corre-sponding to the parent node . The hard-ware implementation for calculating these instructions is

Authorized licensed use limited to: Ericsson Research. Downloaded on March 11, 2009 at 10:45 from IEEE Xplore. Restrictions apply.

PARK et al.: P-SD AND VLSI IMPLEMENTATION FOR MULTIPLE-ANTENNA SYSTEMS 693

Fig. 11. VLSI architecture for proposed P-SD. The hardware for P-SD consists of several functional blocks, and their operation can be described by three stages,in other words, ����������� �� �, ������� �� � �, and ������� �� � ��.

shown in Fig. 12, in the case of 64-QAM. In addition tothe calculation of instructions, sorts the calcu-lated instructions in terms of their distances. This enablesthe one-node-per-cycle operation by reducing the compu-tational complexity of the subsequent calculations, for ex-ample, the update of a list of candidates. Generally, to sortnumbers in an ascending (or descending) order is compu-tationally prohibitive. In this case, however, the instruc-tions can be easily sorted, since their distances are depen-dent on the position of in the 2-D space, as shown in(13). For example, in the case of 64-QAM, by dividing the2-D space into 256 rectangular regions and finding whichrectangular region belongs to, the instructions can beeasily sorted in an ascending order of partial distance, asshown in Fig. 13: The instructions are sorted in the order of111101, 110101, 111111, 110111, and so on. The positionof is easily found by finding which rectangular region

belongs to the first quadruple.2) controls the tree traversal according to the DF

search strategy shown in Fig. 2. To be specific, it chooseseither forward or backward recursion for thenext clock cycle and generates an instruction for the nextparent node , based on the calculated instruc-tions or the instructions corresponding to thepreferred nodes .

3) stores the remaining calculated in-structions onto the pipeline register

if the tree traversal is in forward recursion.It also stores an instruction corresponding to the preferredchild node among the calculatedinstructions if the child node belongs to the selectedpromising candidate groups (i.e., ).

Fig. 12. Hardware implementation for calculating the partial distances in thecase of 64-QAM. The� partial distances corresponding to child nodes from aparent node are calculated in parallel.

4) selects and stores the preferred instructions, among the instructions stored previously

(in forward recursion) , if the tree traversalis in backward recursion. Just as in , the preferredinstructions are stored only if the corresponding nodebelongs to the selected promising candidate groups (i.e.,

).5) updates a list of candidates based on

the calculated instructions and, subsequently,updates the radius .

Authorized licensed use limited to: Ericsson Research. Downloaded on March 11, 2009 at 10:45 from IEEE Xplore. Restrictions apply.

694 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 56, NO. 3, MARCH 2009

Fig. 13. Sorting of calculated partial distances in the case of 64-QAM. By di-viding the 2-D space into 256 rectangular regions and finding which rectangularregion � belongs to, the instructions can be easily sorted in an ascendingorder of partial distance.

6) selects a predetermined number of candidategroups as preprocessing, based on the channel co-efficients estimated from the received preamble .

7) controls the tree traversal based on the prob-abilistic-search algorithm by determining whether each ofthe child nodes belongs to the selected promising candidategroups and letting and knowit . This is done by simply keeping track of thenumber of visited child nodes for each node, as mentionedin Section V.

The proposed VLSI architecture is quite similar to thatproposed in [11] and [12], since the proposed P-SD is thesame as DF-SD except that the tree traversal is based on theprobabilistic-search algorithm as well as the DF search strategy,as aforementioned in Section V. From the perspective of hard-ware architecture, P-SD can be implemented by simply addingtwo functional blocks and to theconventional DF-SD: The other functional blocks ( ,

, , , , and ) arecommon to the conventional DF-SD. Therefore, it is readilyunderstood that most of the state-of-the-art architectures forDF-SD (e.g., [11], [12]) are also applicable to the P-SD. This isone of the advantages of our proposed SD.

Whenever the nodes at the bottom of tree are evaluated,updates a list of candidates, based on the corre-

sponding distances. To be specific, candidates are newlychosen so as to minimize the distances among evaluatedcandidates and previously stored candidates. Since thecalculated instructions are already sorted in terms of theirdistances, the question is on how to merge two separatelysorted sets, a set of distances of evaluated candidatesand the other set of distances of previously storedcandidates. How to merge two separately sorted sets into a

Fig. 14. Merging of two separately sorted sets into a new sorted set. The re-quired number of comparisons is approximately proportional to � , when � islarger than �.

Fig. 15. Hardware implementation for merging two separately sorted sets(� � �, � � ��). All the necessary comparisons are performed in parallel toenable one-clock-cycle operation.

new sorted set is described as shown in Fig. 14. The pseu-docodes in Fig. 14 (left side) show how to generate a setof sorted numbersfrom two separately sorted sets, a set of sorted num-bers , and a set ofsorted numbers . Forexample, when and are equal to 4 and 16, respectively,Fig. 14 (right side) shows how to generate , thesixth number of the set, and the corresponding hardware im-plementation is described as shown in Fig. 15. One of ninenumbers ( , , , , and

) is chosen asbased on 16 comparisons between two sorted sets. Note thatall the necessary comparisons are performed in parallel toenable one-clock-cycle operation in our VLSI architecture.From the pseudocodes shown in Fig. 14, it follows that therequired number of comparisons is roughly proportional toif is larger than (whereas it is proportional to ifis even smaller than ). This sorted–set merging is used notonly in updating the list of candidates ( withand ) but also in selecting the promising candidategroups ( with and ). A similarapproach has been introduced in [16], and a modified algorithm

Authorized licensed use limited to: Ericsson Research. Downloaded on March 11, 2009 at 10:45 from IEEE Xplore. Restrictions apply.

PARK et al.: P-SD AND VLSI IMPLEMENTATION FOR MULTIPLE-ANTENNA SYSTEMS 695

Fig. 16. Hardware implementation for calculating ����. � �� � is recur-sively calculated based on � �� �, according to (17).

has shown the possibility of efficient hardware implementation[13].

Now, let us describe that is unique toour proposed P-SD. Based on the channel coefficients

, this functional block selects thepromising candidate groups that minimize . As afore-mentioned in Section V, this can be efficiently performed byutilizing a simplified version of -best algorithm [15]–[17].Each node in the th level is visited as a parent node,and all the child nodes from a parent node (in the th level) areevaluated within a single clock cycle (i.e., one-node-per-cycleoperation). The corresponding hardware implementation isdescribed as shown in Fig. 16. In addition, the evaluated partialvector symbols and previously stored partial vector symbolsin the level are merged into partial vector symbols in thesubsequent clock cycle, as shown in Fig. 14. Consequently, itis easily understood that the selection of promising candidategroups takes only 19 clock cycles: A single clock cycle forthe fourth level and nine clock cycles for the second and thirdlevels.

The proposed VLSI architecture for four transmit and re-ceive antennas and 64-QAMwas modeled in Verilog HDL and synthesized with SynopsysDesign Compiler using a 0.25- m CMOS technology. Theprelayout simulation results were bit-accurately verified, andthe maximum clock frequency of 51 MHz was verified usingprelayout static timing analysis. The synthesis results show thatthe total hardware area amounts to 270.6 kilo gate-equivalents(kGE), and the additional functional blocks, and

, occupy only a small fraction, i.e., 3.8% (10.3 kGE),as shown in Fig. 17. The synthesis results also show that thecritical path of 19 ns exists in and , whichranges from to throughin . Note that belongs to

that is independent of the critical path,and has a small path delay of 13 ns and a smallhardware area of 1.2 kGE (0.4%) in .Therefore, recalling that P-SD can be implemented by simplyadding these two functional blocks to DF-SD (e.g., [11], [12]),any state-of-the-art architectures of DF-SD that promise smallhardware area and high clock frequency can be readily ex-ploited in the proposed VLSI architecture of P-SD.

Before going to Section VIII, let us consider the requiredhardware complexity of our proposed SD, when the hardwarethroughput is given by a wireless standard. In order to achievea hardware throughput in vector symbols per second with a

Fig. 17. Estimated hardware areas of functional blocks (0.25-�m CMOS tech-nology). The total hardware area amounts to 270.6 kGE and the maximum clockfrequency is 51 MHz.

clock frequency in hertz, multiple hardware units may needto operate in parallel, and the required number of hardware unitsis given by [11]. Consequently, the saving of clockcycles can be seen as that of hardware complexity. For example,to achieve a hardware throughput of 12 mega-vector-symbolsper second given by IEEE 802.11n [25], the required number ofhardware units is equal to , assuming a clock frequencyof 50 MHz. Therefore, as shown in the previous section, a re-quired SNR of 25 dB is achievable at the expense of 12 hardwareunits in P-SD and 48 hardware units inDF-SD, implying that the proposed P-SD provides the saving of36 hardware units (75%).

VIII. COMPARISON: PERFORMANCE-COMPLEXITY TRADEOFFS

In this section, the tradeoffs between performance and com-plexity are presented to verify the advantage of our proposedP-SD. Even though there have been several reports in the liter-ature [10], [12], [27], relatively less attention has been paid torigorous comparison of performance and complexity in the pres-ence of early termination [19]–[21]. Furthermore, the perfor-mance evaluation is highly dependent on both the system param-eters [e.g., the number of antennas ( , ) and modulationorder ] and the channel parameters (e.g., delay spread andmaximum Doppler frequency), whereas the complexity anal-ysis is significantly affected by the choice of VLSI architectures(e.g., pipelining and parallel processing). Since a rigorous anal-ysis of performance and complexity of SD is an independentresearch topic, the focus of this section is on the comparisonof the proposed probabilistic-search strategy with conventionalDF and breadth-first search strategies. Specifically, the perfor-mance is measured by the required SNR for a BER of ,whereas the complexity is measured by the required compu-tational complexity of tree traversal. Note that since the hard-ware complexity such as hardware area and power consump-tion highly depends on the VLSI architecture, the computationalcomplexity that is independent of hardware architecture is oftenconsidered as a more reasonable measure of complexity. Ad-ditionally, note that the preprocessing such as QR decomposi-tion and selection of promising candidate groups, which is car-ried out even less frequently (e.g., when channel coefficientsare changed), is not considered in analyzing the computational

Authorized licensed use limited to: Ericsson Research. Downloaded on March 11, 2009 at 10:45 from IEEE Xplore. Restrictions apply.

696 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 56, NO. 3, MARCH 2009

TABLE ICOMPUTATIONAL COMPLEXITY OF SD

complexity, since it requires only a small fraction of computa-tion. For the purpose of comparison, conventional SD methods,DF-SD in [11]–[13], and -best algorithm in [16], [17], areconsidered here. However, since we are focused on the com-parison of tree-search strategies, algorithmic modifications pro-posed in [12], [13], [16], [17] are not considered here.

To facilitate rigorous comparison, the tree traversal of SDis categorized into distance calculation, sorted–set merging,memory access (including write and read operations), sorting,and minimum search. The analysis of computational com-plexity is summarized in Table I, where denotes the numberof clock cycles in the th level, and therefore, it follows that

and .Assuming the one-node-per-cycle operation as in [12], [13],

[16], the computational complexity of distance calculation issimply proportional to the number of clock cycles , as shownin Table I. Here, denotes the computational complexityof distance calculation for -ary QAM. As readily understoodfrom Fig. 12, the distance calculation includesadditions and 2 multiplications for 16-QAM andadditions and 2 multiplications for 64-QAM for each clockcycle. In more detail, the computational complexity amounts to

for 16-QAM andfor 64-QAM, where , ,

, , and denote the computational complexityof addition, full multiplication, constant-coefficient multi-plication, variable-coefficient multiplication, and squaring,respectively.

Two separately sorted sets are merged into a new sorted set,based on their distances. While DF-SD and P-SD mergeevaluated candidates and previously stored candidates inthe first level to update the list, -best algorithm additionallymerges evaluated partial vector symbols and previouslystored ones in the second level, and evaluated partial vectorsymbols and previously stored ones in the third level.In Table I, denotes the computational complexityof sorted–set merging with two parameters and . Notethat the sorted–set merging is dominated by the number ofcomparisons. Consequently, the computational complexity ap-proximately amounts to 21 comparisons for ,136 comparisons for , and 2080 comparisonsfor .

Fig. 18. Tradeoffs between performance and complexity (1): Distance calcu-lation. The computational complexity is measured by the required number ofclock cycles.

With respect to memory access, DF-SD and P-SD requirewrite operations and read operations of can-didates to update a list and write/read operations of

partial vector symbols (i.e., either write operations for for-ward recursion or read operations for backward recursion). Onthe other hand, -best algorithm performs write operationsand read operations of candidates to update thelist, write operations and read operations of partialvector symbols in the second level, write operations andread operations of partial vector symbols in the third level,and a write operation and a read operation of partial vectorsymbols in the fourth level. In Table I, denotes the compu-tational complexity of a single write/read operation of vectorsymbols.

While DF-SD and P-SD require the sorting of evaluatedcandidates only in the first level to update a list, -best algo-rithm requires the sorting of evaluated partial vector symbolsat every level. In Table I, denotes the computational com-plexity of sorting numbers.

DF-SD and P-SD search the minimum distance in order tofind a preferred child node among yet-to-be-visited nodesfor the SE ordering, as opposed to -best algorithm. In Table I,

denotes the computational complexity of minimum searchfor numbers.

Based on earlier analysis, the tradeoffs between performanceand complexity are evaluated with respect to distance cal-culation, sorted–set merging, and memory access, as shownin Figs. 18–20, respectively. The number of clock cycles isgiven as follows: , 50, 60, 75, 100, 150, and 200 forDF-SD, , 30, 40, 50, 75, 100, 150, and 200 for P-SD,and , (6, 6, 16, 1),(6, 16, 16,1),{(16, 16}, 16, 1),(16, 16, 64, 1), (16, 64, 64, 1), and (64, 64,64, 1) for -best algorithm. Here, note that the expectation ofcomputational complexity is taken for both DF-SD and P-SD,since is not constant but randomly varying. It is shownthat the proposed P-SD is definitely advantageous in terms ofboth performance and complexity, when it is compared with

Authorized licensed use limited to: Ericsson Research. Downloaded on March 11, 2009 at 10:45 from IEEE Xplore. Restrictions apply.

PARK et al.: P-SD AND VLSI IMPLEMENTATION FOR MULTIPLE-ANTENNA SYSTEMS 697

Fig. 19. Tradeoffs between performance and complexity (2): Sorted–setmerging. The computational complexity is measured by the required numberof comparisons.

Fig. 20. Tradeoffs between performance and complexity (3): Memory access.The computational complexity is measured by the number of required write/readoperations.

DF-SD. Interestingly, the proposed P-SD is also advantageousover -best algorithm, when the ML error performance (i.e.,the required SNR of 22 dB) is desired, implying that theprobabilistic-search strategy can be considered as an alter-native to conventional DF and breadth-first search strategies.For instance, while -best algorithm requires 193 clock cy-cles (distance calculation), 283 248 comparisons (sorted–setmerging), and 18544 write/read operations (memory access)to approach the ML error performance, our proposed P-SDrequires 200 clock cycles, 3604 comparisons, and 13600write/read operations. Recalling that the ML detection canbe seen as DF-SD without radius reduction and SE orderingwith , it follows that thecomputational complexity amounts to 266305 clock cycles,

comparisons, and write/read operations.

IX. CONCLUSION

In this paper, the probabilistic-search algorithm was appliedto the conventional DF-SD to improve the error performance.Our proposed P-SD evaluates more promising candidatesearlier (before termination) by simply ignoring less promisingcandidate groups, while maintaining the hardware efficiency ofDF-SD. Based on the performance and complexity tradeoffs,it was clearly shown that the proposed P-SD is more efficientthan conventional SD such as DF-SD and -best algorithmwhen the ML error performance is desired. Since the proposedP-SD can be implemented by simply adding two functionalblocks, it can be generally applied to most state-of-the-artDF-SD architectures reported in the literature. This is one ofthe advantages of our proposed P-SD in that it can fully exploitany algorithmic and architectural improvements for DF-SD. Forexample, it can be combined with the MF strategy [19]–[21]:Once the maximum number of clock cycles has been scheduledfor a specific vector symbol, the optimum number of promisingcandidate groups is applied to confine the tree search into morepromising candidate groups. Furthermore, even though theproposed P-SD is targeted for LDPC-coded MIMO-OFDMsystems, it is readily applicable to many detection methodssuch as multiuser detection at cellular networks as well as linearconstellation precoding [28].

ACKNOWLEDGMENT

The authors would like to thank J. Lee and Y. Zhang for theirvaluable comments. They would also like to thank the anony-mous reviewers for their comments and suggestions.

REFERENCES

[1] I. E. Telatar, “Capacity of multiantenna Gaussian channels,” Eur.Trans. Commun., vol. 10, no. 6, pp. 585–595, Nov./Dec. 1999.

[2] D. Gesbert, M. Shafi, D. Shiu, P. J. Smith, and A. Naguib, “Fromtheory to practice: An overview of MIMO space-time coded wirelesssystems,” IEEE J. Sel. Areas Commun., vol. 21, no. 3, pp. 281–302,Apr. 1998.

[3] Enhanced Wireless Consortium (EWC) [Online]. Available:http://www.enhancedwireless-consortium.org/home

[4] The 3-rd Generation Partnership Project (3GPP)—Long Term Evolu-tion (LTE) [Online]. Available: http://www.3gpp.org/Highlights/LTE/LTE.htm

[5] L. Bahl, J. Cocke, F. Jelinek, and J. Raviv, “Optimal decoding of linearcodes for minimizing symbol error rate,” IEEE Trans. Inf. Theory, vol.IT-20, no. 2, pp. 284–287, Mar. 1974.

[6] B. M. Hochwald and S. ten Brink, “Achieving near-capacity on a mul-tiple-antenna channel,” IEEE Trans. Commun., vol. COM-51, no. 3, pp.389–399, Mar. 2003.

[7] H. Vikalo, B. Hassibi, and T. Kailath, “Iterative decoding for MIMOchannels via modified sphere decoding,” IEEE Trans. WirelessCommun., vol. 3, no. 6, pp. 2299–2311, Nov. 2004.

[8] B. Hassibi and H. Vikalo, “On the sphere-decoding algorithm I. Ex-pected complexity,” IEEE Trans. Signal Process., vol. 53, no. 8, pp.2806–2818, Aug. 2005.

[9] J. B. Anderson and S. Mohan, “Sequential coding algorithms: A surveyand cost analysis,” IEEE Trans. Commun., vol. COM-32, no. 2, pp.169–176, Feb. 1984.

[10] A. D. Murugan, H. E. Gamel, M. O. Damen, and G. Caire, “A unifiedframework for tree search decoding: Rediscovering the sequential de-coder,” IEEE Trans. Inf. Theory, vol. 52, no. 3, pp. 933–953, Apr. 2006.

[11] D. Garrett, L. Davis, S. ten Brink, B. Hochwald, and G. Knagge,“Silicon complexity for maximum likelihood MIMO detection usingspherical decoding,” IEEE J. Solid-State Circuits, vol. 39, no. 9, pp.1544–1552, Sep. 2004.

Authorized licensed use limited to: Ericsson Research. Downloaded on March 11, 2009 at 10:45 from IEEE Xplore. Restrictions apply.

698 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 56, NO. 3, MARCH 2009

[12] A. Burg, M. Borgmann, M. Wenk, M. Zellweger, W. Fichtner, and H.Boelcskei, “VLSI implementation of MIMO detection using the spheredecoding algorithm,” IEEE J. Solid-State Circuits, vol. 40, no. 7, pp.1566–1577, Jul. 2005.

[13] M. Wenk, A. Burg, M. Zellweger, C. Studer, and W. Fichtner, “VLSIimplementation of the list sphere algorithm,” in Proc. 24th NorchipConf., 2006, pp. 107–110.

[14] Z. Guo and P. Nilsson, “A VLSI architecture of the Schnorr–Euchnerdecoder for MIMO systems,” in Proc. IEEE CAS Symp. EmergingTechnol., Jun. 2004, pp. 65–68.

[15] H. Kawai, K. Higuchi, N. Maeda, M. Sawahashi, T. Ito, Y. Kakura, A.Ushirokawa, and H. Seki, “Likelihood function for qrm-mld suitablefor soft-decision turbo decoding and its performance for ofcdm MIMOmultiplexing in multipath fading channel,” IEICE Trans. Commun.,vol. E88-B, no. 1, pp. 47–57, Jan. 2005.

[16] M. Wenk, M. Zellweger, A. Burg, N. Felber, and W. Fichtner, “K-bestMIMO detection VLSI architectures achieving up to 424 MBPs,” Proc.IEEE ISCAS, pp. 1151–1154, May 2006.

[17] S. Chen, T. Zhang, and Y. Xin, “Relaxed K-best MIMO signal detectordesign and VLSI implementation,” IEEE Trans. Very Large Scale In-tegr. (VLSI) Syst., vol. 15, no. 3, pp. 328–337, Mar. 2007.

[18] L. G. Barbero and J. S. Thompson, “Performance analysis of a fixed-complexity sphere decoder in high-dimensional MIMO systems,” inProc. Int. Conf. Acoust., Speech Signal Process., Toulouse, France,May 2006, vol. 4, pp. 557–560.

[19] A. Burg, M. Borgmann, M. Wenk, C. Studer, and H. Bolcskei, “Ad-vanced receiver algorithms for MIMO wireless communications,” inProc. ACM Des. Autom. Test Eur. Conf., Mar. 2006, pp. 593–598.

[20] A. Burg, M. Wenk, and W. Fichtner, “VLSI implementation ofpipelined sphere decoding with early termination,” in Proc. Eur.Signal Process. Conf., Florence, Italy, Sep. 2006.

[21] C. Studer, M. Wenk, A. Burg, and H. Bolcskei, “Soft-output spheredecoding: Performance and implementation aspects,” in Proc. 40thAsilomar Conf. Signals, Syst., Comput., Pacific Grove, CA, Nov. 2006,pp. 2071–2076.

[22] W. Zhao and G. B. Giannakis, “Reduced complexity closest pointdecoding algorithms for random lattices,” IEEE Trans. WirelessCommun., vol. 5, no. 1, pp. 101–111, Jan. 2006.

[23] C. P. Schnorr and M. Euchner, “Lattice basis reduction: Improved prac-tical algorithms and solving subset sum problems,” Math. Program.,vol. 66, no. 1–3, pp. 181–191, Aug. 1994.

[24] D. Shiu and J. M. Kahn, “Layered space-time codes for wirelesscommunications using multiple transmit antennas,” in Proc. Int. Conf.Commun., Jun. 1999, pp. 436–440.

[25] V. K. Jones et al., WWiSE IEEE 802.11n Proposal 2004, IEEE802.11-04/0935r3.

[26] J. Chen and M. P. C. Fossorier, “Near optimum universal belief propa-gation based decoding of low-density parity check codes,” IEEE Trans.Commun., vol. 50, no. 3, pp. 406–414, Mar. 2002.

[27] A. Burg, M. Borgmann, C. Simon, M. Wenk, M. Zellweger, and W.Fichtner, “Performance tradeoffs in the VLSI implementation of thesphere decoding algorithm,” in Proc. IEE 3G Mobile Commun. Conf.,Oct. 2004, pp. 93–97.

[28] Z. Liu, Y. Xin, and G. B. Giannakis, “Linear constellation precodingfor OFDM with maximum multipath diversity and coding gains,” IEEETrans. Commun., vol. 51, no. 3, pp. 416–427, Mar. 2003.

Chester Sungchung Park (S’02–M’06) was born inMinnesota. He received the B.S., M.S., and Ph.D. de-grees in electrical engineering from Korea AdvancedInstitute of Science and Technology, Daejeon, Korea,in 2000, 2002, and 2006, respectively.

During his doctoral studies, he was a VisitingResearcher at the University of Minnesota, Min-neapolis. From 2006 to October 2007, he was withSamsung Advanced Institute of Technology, Yongin,Gyeonggi, Korea, where he conducted researcheson modulation and coding for wireless modems and

high-density memories. Since November 2007, he has been a Senior StaffSystem Engineer with Ericsson Research, Research Triangle Park, NC. His re-search interests include communication algorithms and VLSI implementations,focusing on multiple-input–multiple-output orthogonal frequency-divisionmultiplexing and channel coding.

Keshab K. Parhi (S’85–M’88–SM’91–F’96) received the B.Tech. degree fromthe Indian Institute of Technology, Kharagpur, India, in 1982, the M.S.E.E. de-gree from the University of Pennsylvania, Philadelphia, in 1984, and the Ph.D.degree from the University of California, Berkeley, in 1988.

Since 1988, he has been with the University of Minnesota, Minneapolis,where he is currently a Distinguished McKnight University Professor with theDepartment of Electrical and Computer Engineering. His research addressesVLSI architecture design and implementation of physical-layer aspects ofbroadband communications systems. He is currently working on error controlcoders and cryptography architectures, high-speed transceivers, and ultrawide-band systems. He has authored over 400 papers, the text book VLSI DigitalSignal Processing Systems (Wiley, 1999), and coedited the reference bookDigital Signal Processing for Multimedia Systems (Marcel Dekker, 1999).

Dr. Parhi was the recipient of numerous awards including the 2004F. E. Terman Award by the American Society of Engineering Education,the 2003 IEEE Kiyo Tomiyasu Technical Field Award, the 2001 IEEEW. R. G. Baker Prize Paper Award, and a Golden Jubilee Award from theIEEE Circuits and Systems Society in 1999. He has served on the EditorialBoards of the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-I: REGULAR

PAPERS and IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-II: EXPRESS

BRIEFS, IEEE TRANSACTIONS ON VLSI SYSTEMS, IEEE TRANSACTIONS

ON SIGNAL PROCESSING, IEEE SIGNAL PROCESSING LETTERS, and IEEESIGNAL PROCESSING MAGAZINE. He served as the Editor-in-Chief of theIEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-I: REGULAR PAPERS from2004 to 2005 and serves on the Editorial Board of the Journal of VLSI SignalProcessing. He has served as Technical Program Cochair of the 1995 IEEEVLSI Signal Processing Workshop and the 1996 ASAP Conference and as theGeneral Chair of the 2002 IEEE Workshop on Signal Processing Systems. Hewas a Distinguished Lecturer for the IEEE Circuits and Systems Society from1996 to 1998.

Sin-Chong Park (M’97) received the B.S. degreein applied physics from Seoul National University,Seoul, Korea, in 1964, the M.S. degree in physicsfrom Purdue University, West Lafayette, IN, in1973, and the Ph.D. degree in chemical physics fromUniversity of Minnesota, Minneapolis, in 1978.

From 1978 to 1982, he was the Head of theLaser Physics Laboratory, Agency for DefenseDevelopment, Daejeon, Korea. From 1983 to 1997,he was the Executive Director of the SemiconductorDivision and the Vice President of the Electronics

and Telecommunications Research Institute, Daejeon. From 1993 to 1995, hewas a National Project Leader of GDRAM Research. Since 1998, he has beena Professor with the School of Engineering, Information and CommunicationsUniversity, Daejeon, where he serves as director of the System IntegrationTechnology Institute. He has authored or coauthored over 330 papers. Heis the holder of 37 patents. His research interests include next-generationcommunications and system-on-a-chip.

Authorized licensed use limited to: Ericsson Research. Downloaded on March 11, 2009 at 10:45 from IEEE Xplore. Restrictions apply.