to be appeared in journal of communications and …

13
arXiv:1101.3698v1 [cs.AR] 17 Jan 2011 TO BE APPEARED IN JOURNAL OF COMMUNICATIONS AND NETWORKS 1 Systolic Arrays for Lattice-Reduction-Aided MIMO Detection Ni-Chun Wang, Ezio Biglieri, and Kung Yao Abstract—Multiple-input, multiple-output (MIMO) technology provides high data rate and enhanced QoS for wireless com- munications. Since the benefits from MIMO result in a heavy computational load in detectors, the design of low-complexity sub-optimum receivers is currently an active area of research. Lattice-reduction-aided detection (LRAD) has been shown to be an effective low-complexity method with near-ML performance. In this paper we advocate the use of systolic array architectures for MIMO receivers, and in particular we exhibit one of them based on LRAD. The “LLL lattice reduction algorithm” and the ensuing linear detections or successive spatial-interference cancellations can be located in the same array, which is con- siderably hardware-efficient. Since the conventional form of the LLL algorithm is not immediately suitable for parallel processing, two modified LLL algorithms are considered here for the systolic array. LLL algorithm with full-size reduction (FSR-LLL) is one of the versions more suitable for parallel processing. Another variant is the all-swap lattice-reduction (ASLR) algorithm for complex-valued lattices, which processes all lattice basis vectors simultaneously within one iteration. Our novel systolic array can operate both algorithms with different external logic controls. In order to simplify the systolic array design, we replace the Lovász condition in the definition of LLL-reduced lattice with the looser Siegel condition. Simulation results show that for LR- aided linear detections, the bit-error-rate performance is still maintained with this relaxation. Comparisons between the two algorithms in terms of bit-error-rate performance, and average FPGA processing time in the systolic array are made, which shows that ASLR is a better choice for a systolic architecture, especially for systems with a large number of antennas. Index Terms—Lattice reduction, MIMO receivers, systolic arrays, wireless communications. I. I NTRODUCTION M ULTIPLE-INPUT, multiple-output (MIMO) technol- ogy, using several transmit and receive antennas in a rich-scattering wireless channel, has been shown to provide considerable improvement in spectral efficiency and channel capacity [1]. MIMO systems yield spatial diversity gain, spa- tial multiplexing gain, array gain, and interference reduction over single-input single-output (SISO) systems [2]. However, these benefits come at the price of a computational complexity N.C. Wang, E. Biglieri, and K. Yao are with the Electrical Engineering Department, University of California-Los Angeles, Los Angeles, CA 90095, USA (Address: 56-125B Engineering IV Building, 420 Westwood Plaza, Los Angeles, CA 90095, USA; e-mail: [email protected]; [email protected]; [email protected]). The work of N.C. Wang was partially supported by National Science Council, Taiwan (R.O.C.). (TMS-094-2-A-002). The work of K. Yao was partially supported by NSF CENS program CCR-012, NSF grant EF-0410438, and NSF grant DBI-0754247. E. Biglieri is also with the Departament de Tecnologies de la Informació i les Comunicacions, Universitat Pompeu Fabra, Barcelona, Spain. The work of E.B. was supported by the Spanish Ministry of Education and Science under Project CONSOLIDER-INGENIO 2010 CSD2008-00010 "COMONSENS". of the detector that may be intolerably large. In fact, optimal maximum-likelihood (ML) detection in large MIMO systems may not be feasible in real-time applications as its complexity increases exponentially with the number of antennas. Low- complexity receivers, employing linear detection or successive spatial-interference cancellation (SIC), are computationally less heavy, and amenable to simple hardware implementa- tion [3]–[5]. However, diversity and error-rate performance of these low-complexity detectors are not comparable to those achieved with ML. Lattice-reduction-aided detection (LRAD), which combines lattice reduction techniques with linear detections or SIC, has been shown to yield some improvement on error-rate perfor- mance [6]–[8]. Lenstra-Lenstra-Lovász (LLL) algorithm [9] is the most widely used lattice reduction algorithm, and can be applied to complex-valued lattices [10]. The performance of complex LLL-aided linear detection in MIMO systems was analyzed in [11]. LLL-based LRAD was also shown to achieve full receiver diversity [12]. It was also shown that the LR-aided minimum mean-square-error decoding achieves the optimal diversity-multiplexing tradeoff [16]. When applied to MIMO detection, the average complexity of LLL algorithm is polynomial in the dimension of the channel matrix (the worst- case complexity could be unbounded [13]). A fixed-complexity LLL algorithm, which modifies the original version to allow more robust early termination, has recently been proposed in [17]. In LRAD, LLL algorithm need be performed only when the channel state changes. If the channel change rate is high, or a large number of channel matrices need be pro- cessed such as in a MIMO-OFDM system, a fast-throughput algorithm and the corresponding implementation structure is needed for real-time applications. To obtain this, we first discuss two variants of LLL algorithm, suitably modified for parallel processing. Second, we propose a novel systolic array structure implementing the two modified LLL algorithms and the ensuing detection methods. A systolic array [18], [19] is a network of processing elements (PE) which transfer data locally and regularly with nearby elements and work rhythmically. In Fig. 1(a), a simple two-dimensional systolic array is shown as an example. In this case, the matrix operation D = A · B + C is calculated by the systolic array, where A, B, C and D are 2 × 2 matrices. The operation of each PE is shown in Fig. 1(b). The inputs of the systolic array, the entries of matrices A and C, are pipelined in a slanted manner for proper timing. Since all PEs can work simultaneously, the latency is shorter than with a single processor system, and the results of D are outputted in parallel. Systolic algorithms and the corre-

Upload: others

Post on 09-Nov-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: TO BE APPEARED IN JOURNAL OF COMMUNICATIONS AND …

arX

iv:1

101.

3698

v1 [

cs.A

R]

17 J

an 2

011

TO BE APPEARED IN JOURNAL OF COMMUNICATIONS AND NETWORKS 1

Systolic Arrays for Lattice-Reduction-AidedMIMO Detection

Ni-Chun Wang, Ezio Biglieri, and Kung Yao

Abstract—Multiple-input, multiple-output (MIMO) technologyprovides high data rate and enhanced QoS for wireless com-munications. Since the benefits from MIMO result in a heavycomputational load in detectors, the design of low-complexitysub-optimum receivers is currently an active area of research.Lattice-reduction-aided detection (LRAD) has been shown to bean effective low-complexity method with near-ML performance.In this paper we advocate the use of systolic array architecturesfor MIMO receivers, and in particular we exhibit one of thembased on LRAD. The “LLL lattice reduction algorithm” andthe ensuing linear detections or successive spatial-interferencecancellations can be located in the same array, which is con-siderably hardware-efficient. Since the conventional formof theLLL algorithm is not immediately suitable for parallel proc essing,two modified LLL algorithms are considered here for the systolicarray. LLL algorithm with full-size reduction (FSR-LLL) is oneof the versions more suitable for parallel processing. Anothervariant is the all-swap lattice-reduction (ASLR) algorithm forcomplex-valued lattices, which processes all lattice basis vectorssimultaneously within one iteration. Our novel systolic array canoperate both algorithms with different external logic controls.In order to simplify the systolic array design, we replace theLovász condition in the definition of LLL-reduced lattice withthe looser Siegel condition. Simulation results show that for LR-aided linear detections, the bit-error-rate performance is stillmaintained with this relaxation. Comparisons between the twoalgorithms in terms of bit-error-rate performance, and averageFPGA processing time in the systolic array are made, whichshows that ASLR is a better choice for a systolic architecture,especially for systems with a large number of antennas.

Index Terms—Lattice reduction, MIMO receivers, systolicarrays, wireless communications.

I. I NTRODUCTION

M ULTIPLE-INPUT, multiple-output (MIMO) technol-ogy, using several transmit and receive antennas in a

rich-scattering wireless channel, has been shown to provideconsiderable improvement in spectral efficiency and channelcapacity [1]. MIMO systems yield spatial diversity gain, spa-tial multiplexing gain, array gain, and interference reductionover single-input single-output (SISO) systems [2]. However,these benefits come at the price of a computational complexity

N.C. Wang, E. Biglieri, and K. Yao are with the Electrical EngineeringDepartment, University of California-Los Angeles, Los Angeles, CA 90095,USA (Address: 56-125B Engineering IV Building, 420 Westwood Plaza, LosAngeles, CA 90095, USA; e-mail: [email protected]; [email protected];[email protected]). The work of N.C. Wang was partially supported byNational Science Council, Taiwan (R.O.C.). (TMS-094-2-A-002). The workof K. Yao was partially supported by NSF CENS program CCR-012, NSFgrant EF-0410438, and NSF grant DBI-0754247.

E. Biglieri is also with the Departament de Tecnologies de laInformació iles Comunicacions, Universitat Pompeu Fabra, Barcelona, Spain. The work ofE.B. was supported by the Spanish Ministry of Education and Science underProject CONSOLIDER-INGENIO 2010 CSD2008-00010 "COMONSENS".

of the detector that may be intolerably large. In fact, optimalmaximum-likelihood (ML) detection in large MIMO systemsmay not be feasible in real-time applications as its complexityincreases exponentially with the number of antennas. Low-complexity receivers, employing linear detection or successivespatial-interference cancellation (SIC), are computationallyless heavy, and amenable to simple hardware implementa-tion [3]–[5]. However, diversity and error-rate performance ofthese low-complexity detectors are not comparable to thoseachieved with ML.

Lattice-reduction-aided detection (LRAD), which combineslattice reduction techniques with linear detections or SIC, hasbeen shown to yield some improvement on error-rate perfor-mance [6]–[8]. Lenstra-Lenstra-Lovász (LLL) algorithm [9]is the most widely used lattice reduction algorithm, and canbe applied to complex-valued lattices [10]. The performanceof complex LLL-aided linear detection in MIMO systemswas analyzed in [11]. LLL-based LRAD was also shown toachieve full receiver diversity [12]. It was also shown thattheLR-aided minimum mean-square-error decoding achieves theoptimal diversity-multiplexing tradeoff [16]. When applied toMIMO detection, the average complexity of LLL algorithm ispolynomial in the dimension of the channel matrix (the worst-case complexity could be unbounded [13]). A fixed-complexityLLL algorithm, which modifies the original version to allowmore robust early termination, has recently been proposedin [17]. In LRAD, LLL algorithm need be performed onlywhen the channel state changes. If the channel change rateis high, or a large number of channel matrices need be pro-cessed such as in a MIMO-OFDM system, a fast-throughputalgorithm and the corresponding implementation structureisneeded for real-time applications. To obtain this, we firstdiscuss two variants of LLL algorithm, suitably modified forparallel processing. Second, we propose a novel systolic arraystructure implementing the two modified LLL algorithms andthe ensuing detection methods.

A systolic array [18], [19] is a network of processingelements (PE) which transfer data locally and regularly withnearby elements and work rhythmically. In Fig. 1(a), a simpletwo-dimensional systolic array is shown as an example. Inthis case, the matrix operationD = A · B+C is calculatedby the systolic array, whereA, B, C and D are 2 × 2matrices. The operation of each PE is shown in Fig. 1(b).The inputs of the systolic array, the entries of matricesA

andC, are pipelined in a slanted manner for proper timing.Since all PEs can work simultaneously, the latency is shorterthan with a single processor system, and the results ofD

are outputted in parallel. Systolic algorithms and the corre-

Page 2: TO BE APPEARED IN JOURNAL OF COMMUNICATIONS AND …

2 TO BE APPEARED IN JOURNAL OF COMMUNICATIONS AND NETWORKS

sponding systolic arrays have been designed for a number oflinear algebra algorithms, such as matrix triangularization [20],matrix inversion [21] , adaptive nulling [22], recursive least-square [23], [24], etc. An overview of systolic designs forseveral computationally demanding linear algebra algorithmsfor signal processing and communications applications wasrecently published in [25]. While systolic arrays allow simpleparallel processing and achieve higher data rates withoutthe demand on faster hardware capabilities, the existence ofmultiple PEs implies a higher cost of circuit area. Thus, timeefficiency is traded off with circuit area in hardware design.For the application we are advocating in this paper (MIMOdetectors), systolic arrays offer an attractive solution,as wemust cope with a high computational load while requiringhigh throughput and real-time operation. Systolic arrays havebeen previously suggested for MIMO applications. In [26], theauthors proposed a universal systolic array for adaptive andconventional linear MIMO detectors. In [27], a reconfigurablesystolic array processor based on coordinate rotation digitalcomputer (CORDIC) [28] is proposed to provide efficientMIMO-OFDM baseband processing. Also, matrix factoriza-tion and inversion are widely used in MIMO detection, withsystolic arrays used to increase the throughput [5], [29].

In this paper, our objective is to provide a novel systolicarray design for LLL-based LRAD. The ideas are describedfrom a system-level perspective instead of detailed discussionon the hardware-oriented issues. The system model and howLRAD works are briefly described in Section II. Since theoriginal LLL algorithm [8]–[15] is not designed for parallelprocessing, and hence is not suitable for systolic design, twomodified LLL algorithms are considered here (Section III).Note that we are not claiming the two algorithms worksbetter than the original LLL in terms of the LRAD bit-error-rate (BER) performance. First, we improve on the formatof conventional LLL algorithm by altering the flow of size-reduction process (we call it “LLL with full size-reduction,”or FSR-LLL). FSR-LLL is more time-efficient in parallelprocessing than the conventional format, and hence suitablefor systolic design. We also consider a variant of the LLLalgorithm called “all-swap lattice reduction (ASLR),” whichwas first proposed in [30] for real lattices, and derive itscomplex-number version. A crucial difference between ASLRand LLL algorithm is that with ASLR all lattice basis vectorsare simultaneously processed during a single iteration. Inbothalgorithms, in order to simplify the systolic array operationswe replace the original Lovász condition [9] of LLL algorithmwith the slightly weaker Siegel condition [31]. Surprisingly, forLR-aided linear detections the BER performance with Siegelcondition under the proper parameter setting is just as goodasthe one using Lovász condition. However, for LR-aided SIC,the performance with Lovász condition is still slightly betterdue to less error propagation. The mapping from algorithmto systolic array is introduced in Section IV. In our design,ASLR and FSR-LLL can be operated on the same systolic-array structure, but the external logic controller is also requiredto control the algorithm flow. Additionally, since ASLR wasoriginally designed for parallel processing, a systolic arrayrunning ASLR is on the average more efficient than one

n+1

b11 b12

b21 b22

a11

a21 a12

a22

c11c21

c12c22

d11d21

d12d22

123

12

3

time

n+2n+3

(n: latency)

(a)

in

out

outinout in

out in in

a =a

c =c +a b

(b)

Fig. 1. (a) Two-dimensional systolic array performing matrix

calculation D = A B + C , where , , ,ij ij ij ija b c d are the (i,j) entries of the

matrix , , , and . (b) The operation of each processing element.

44

1312

21 23

14

24

11

22

31 32 3433

41 42 43 44

1312

21 23

14

24

11

22

31 32 3433

41 42 43

(a)

(*)

"#"

; ( , )

"true", if ="false",

:= - ; ,

in

out out

out out

in

out outin in in in

If m ="#"

m c

d r x r t

d rswapotherwise

(Default)

t t y x y y x x

' '' '

! ' '

Data mode

Size Reduction mode

( )

Diagonal cell Dii

in

i nout

q, t, rin out

in

out

outout

wap

out

Off-diagonal cell Oij

q, t, rin out

out

in inout

out in ( , );

If carries "*":

/

: - , : - ;

If doesn't carry "*":

in

out out in

in

in

in in

out out in

in

(If c ="#")

x r t c c

(Default)

r x

r r x t t xy x x

* *

' '

' ! ' !' '

Data mode

Size Reduction mode

!" #$ %

: , : ;

in in in in

out outin in

r y x t t y xy y x x' ) ! ' ) !' '

(b)

Fig. 4. (a) The systolic array for the linear LRAD of 4 4 MIMO system. (b)

The operations of diagonal and off-diagonal cells in the systolic array. (“*” is

an indicator bit used to control the flow of the algorithm, as explained in

Section IV.A)

10 15

0.2

0.4

0.6

0.8

Orhtogonality defect

Em

piric

al cu

mu

lative

pro

ba

bili

ty fu

nctio

n

LLL

ASLR

FSR-LLL

No reduction

=.51

=.75

=.99

Fig. 3. The empirical cumulative probability functions of the orthogonality

defect for the 4 4 channel matrices under three different reduction

algorithms.

H = H T

Lattice Reduction

Rounding

y( )

H = H T

Lattice Reduction

Rounding

y( )

Fig. 2. Block diagram of linear lattice-reduction-aided detection

(a)

n+1

11 12

21 22

11

21 12

22

1121

1222

1121

1222

12

3

time

n+2n+3

(n: latency)

(a)

b

ain

aout

coutcin!

out in

out in in

a =a

c =c +a b

(b)

Fig. 1. (a) Two-dimensional systolic array performing matrix

calculation D = A B + C , where , , ,ij ij ij ija b c d are the (i,j) entries of the

matrix , , , and . (b) The operation of each processing element.

44

1312

21 23

14

24

11

22

31 32 3433

41 42 43 44

1312

21 23

14

24

11

22

31 32 3433

41 42 43

(a)

(*)

"#"

; ( , )

"true", if ="false",

:= - ; ,

in

out out

out out

in

out outin in in in

If m ="#"

m c

d r x r t

d rswapotherwise

(Default)

t t y x y y x x

' '' '

! ' '

Data mode

Size Reduction mode

( )

Diagonal cell Dii

in

i nout

q, t, rin out

in

out

outout

wap

out

Off-diagonal cell Oij

q, t, rin out

out

in inout

out in ( , );

If carries "*":

/

: - , : - ;

If doesn't carry "*":

in

out out in

in

in

in in

out out in

in

(If c ="#")

x r t c c

(Default)

r x

r r x t t xy x x

* *

' '

' ! ' !' '

Data mode

Size Reduction mode

!" #$ %

: , : ;

in in in in

out outin in

r y x t t y xy y x x' ) ! ' ) !' '

(b)

Fig. 4. (a) The systolic array for the linear LRAD of 4 4 MIMO system. (b)

The operations of diagonal and off-diagonal cells in the systolic array. (“*” is

an indicator bit used to control the flow of the algorithm, as explained in

Section IV.A)

10 15

0.2

0.4

0.6

0.8

Orhtogonality defect

Em

piric

al cu

mu

lative

pro

ba

bili

ty fu

nctio

n

LLL

ASLR

FSR-LLL

No reduction

=.51

=.75

=.99

Fig. 3. The empirical cumulative probability functions of the orthogonality

defect for the 4 4 channel matrices under three different reduction

algorithms.

H = H T

Lattice Reduction

Rounding

y( )

H = H T

Lattice Reduction

Rounding

y( )

Fig. 2. Block diagram of linear lattice-reduction-aided detection

(b)

Fig. 1. (a) Two-dimensional systolic array performing matrix calculationD = A ·B+C , whereaij , bij , cij , dij are the(i, j) entries of the matrixA,B,C, andD. (b) The operation of each processing element.

running FSR-LLL. Simulation results also show that ASLR-based LRAD has a BER performance very similar to that ofLLL algorithm. Comparison between our proposed design andthe conventional LLL in FPGA implementation shows thatthe systolic arrays do provide faster processing speed withamoderate increase of hardware resources. After the channelstate matrix has been lattice-reduced, linear detectors orSICcan also be implemented by the same systolic array withoutany extra hardware cost, which is discussed in Section V.

The following notations are used throughout the remain-ing sections. Capital bold letters denote matrices, and lowercase bold letters denote column vectors. For example,X =[x1,x2, · · · ,xm] is a matrix withm columns ofx1 to xm. Theentry of a matrixX at position(i, j) is denoted byxi,j , andthekth element of a vectorx is denoted byxk. The submatrix(subvector) formed from theath to bth rows andmth tonth columns of X is denoted byXa:b,m:n. The notations(·)+, (·)T , (·)H and (·)† are used for conjugate, transpose,Hermitian transpose, and Moore-Penrose pseudo-inverse ofamatrix, respectively.‖x‖ is the Euclidean norm of the vectorx. ℜ(·) andℑ(·) are the real and imaginary parts of a complexnumber, respectively.⌈x⌋ indicates the closest integer tox. Ifx is a complex number, then⌈x⌋ = ⌈ℜ(x)⌋ + i ⌈ℑ(x)⌋. Imand0m arem×m identity and null matrices, respectively.

II. L ATTICE-REDUCTION-A IDED DETECTION

A. System Model

We consider a MIMO system withm transmit andn re-ceive antennas in a rich-scattering flat-fading channel. Spatialmultiplexing is employed, so that data are transmitted asmsubstreams of equal rate. These substreams are mapped ontoM-ary QAM symbols. Letx denote the complex-valuedm×1transmitted signal vector, andy the complex-valuedn × 1received signal vector. The baseband model for this MIMOsystem is

y = Hx+ n, (1)

whereH is then×m channel matrix: its entries are uncorre-lated, zero-mean, unit-variance complex circularly symmetric

Page 3: TO BE APPEARED IN JOURNAL OF COMMUNICATIONS AND …

WANG et al.: SYSTOLIC ARRAYS FOR LATTICE-REDUCTION-AIDED MIMO DETECTION 3

Gaussian fading gainshij , and n is the n × 1 additivewhite complex Gaussian noise vector with zero mean andE[nnH ] = σ2I. The average power of each transmitted signalxi is assumed to be normalized to 1, i.e.,E[xxH ] = I.Additionally, we assume that the channel matrix entries arefixed during each frame interval, and the receiver has perfectknowledge of the realization ofH.

B. Linear Detection

In linear detection, the estimated signalx is computed byfirst premultiplying the received signaly by ann×m “weightmatrix” W. The two most common design criteria forW arezero-forcing (ZF) and minimum mean-square error (MMSE).In zero-forcing detection, the weight matrixWZF is set to bethe Moore-Penrose pseudo-inverseH† of the channel matrixH, i.e.,

xZF = WZFy = H†y = x+H†n. (2)

It is known that zero-forcing detection suffers from the noiseenhancement problem, as the channel matrix may be ill-conditioned. Under the MMSE criterion, the weight matrixW is chosen in such a way that the mean-squared-errorbetween the transmitted signalx and the estimated signalxis minimized. The mean-squared-error (MSE) is defined asMSE

∆= E[‖x− x‖2] = E

[

(x−Wy)H(x−Wy)]

. Theweight matrixW that minimizes the MSE is

WMMSE = (HHH+ σ2I)−1HH , (3)

It is well known that, asσ2 → 0, the weight matrixWMMSE

approachesWZF . SinceWMMSE takes noise power intoconsideration, MMSE detection suffers less from noise en-hancement than ZF detection. In [8], [32], it is shown thatMMSE is equivalent to ZF in an extended system model, i.e.,

xMMSE = WMMSEy = H†y = (HHH)−1HHy, (4)

where

H =

[

H

σIm

]

andy =

[

y

0m×1

]

. (5)

Comparing (2) with (4), it follows that the two detectionmethods can share the same structure in systolic-array im-plementation, which we shall elaborate upon in Section IV.

C. Lattice-Reduction-Aided Linear Detection

The idea underlying lattice reduction is the selection ofa basis vector for the lattice under some goodness crite-rion [33]. We first observe that, under the assumption ofQAM transmission, the transmitted vectorx is an integer pointof a square lattice (after proper scaling and shifting of theoriginal QAM constellation). By interpreting the columns ofthe channel matrixH as a set of lattice basis vectors,Hx isalso a lattice point. If two basis setsH andH are related byH = H · T, T a unimodularmatrix, they generate the sameset of lattice points. In MIMO detection, the objective of thelattice reduction algorithm is to derive a better-conditionedchannel matrixH. In this paper, we focus on the complex-valued LLL algorithm [10], [11]. More details about the LLLalgorithm will be provided in Section III.

�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������H = H Tɶ i

†Hɶ+

Lattice Reduction

Roundingx ˆ qx ˆ LRxyx

n

H T Q( )

Fig. 2. Block diagram of linear lattice-reduction-aided detection

After lattice-reduction of the channel matrix, we can per-form the linear detection, as described in Section II-B, basedon H. Consider ZF first. The estimated signalx can be writtenas

x = H†y = H†(

(HT)(T−1x) + n)

= T−1x+ H†n. (6)

Sincex is no longer an integer vector, the simplest but subopti-mal way of estimatingT−1x is to roundx element-wise to thenearest integer. Letxq be an estimate ofT−1x after rounding.The final step is to transformxq back into an estimate ofx,which is done by multiplyingxq by the unimodular matrixT. Since the vector entries after the transformation could lieoutside the QAM constellation boundary, we finally quantizethose points outside the boundary to the closest constellationpoint, i.e., xLR = Q(Txq). Fig. 2 shows the block diagramof LR-aided ZF detection for MIMO. It is easy to see thatthe same structure can also be used for MMSE detection, bysimply replacingH andy with the extended matrixH and thevectory defined in (5), respectively. The remaining operationsare the same as in ZF.

D. LR-Aided Successive Spatial-Interference Cancellation

Besides being suitable linear detection systolic designcan be used to exploit the regularity of successive spatial-interference cancellation (SIC). In [8], it is shown that LR-aided SIC outperforms linear detection methods, while ex-hibiting a complexity comparable to linear detection. TheLR-aided SIC can be conveniently described in terms of theQR decomposition of the reduced channel matrix. Here wesummarize briefly the procedure of LR-aided ZF-SIC only, asthe LR-aided MMSE-SIC can be derived in a similar way.Let the QR decomposition of the reduced channel matrix beH = QR. First, multiply QH to y in (1), we obtain

v∆= QHy = Rz+ Q

Hn, where z = T−1x. (7)

Then we can solve forz layer by layer starting from the bottomto the top, i.e.

zi =

vi

rii

, v := v − (R1:i,i)zi, (8)

wherei starts fromm to 1 andzi is the estimate of each entryof z.

III. T WO VARIANTS OF LLL A LGORITHM

In this section, we introduce two variants of LLL algorithmwhich are more time-efficient than the classical LLL algorithm

Page 4: TO BE APPEARED IN JOURNAL OF COMMUNICATIONS AND …

4 TO BE APPEARED IN JOURNAL OF COMMUNICATIONS AND NETWORKS

when using parallel processing. Since systolic arrays yield asimple form of parallel processing, our systolic array designfor LRAD is based on these two algorithms.

We begin the discussion with the definition of LLL-reducedlattice. LetH (ann×m matrix) be a set of lattice basis vectors,with QR decompositionH = QR. The basis setH is complexLLL-reducedwith parameterδ (1/2 < δ < 1), if the followingtwo conditions are satisfied [10], [11]:(a)

µi,j∆=

ri,jri,i

, |ℜ(µi,j)| ≤12

and |ℑ(µi,j)| ≤12 , 1 ≤ i < j ≤ m,(9)

(b)

δ −

ri−1,i

ri−1,i−1

2

≤|ri,i|

2

|ri−1,i−1|2 , 2 ≤ i ≤ m. (10)

The second condition in (10) is called theLovász condition,and the process to make the basis set satisfy (9) is calledsizereduction. In the standard form of LLL algorithm consideredin the literature [8]–[15], size reduction applies only to onecolumn ofH during a single iteration. Now, systolic arrays,allowing simple parallel processing, are capable of updatingthe whole matrix without introducing extra delays. Hence, ourproposed systolic array is first designed based on the LLLalgorithm in a different form, which we call it “LLL algorithmwith full size reduction (FSR-LLL).”

A. LLL algorithm with Full Size Reduction (FSR-LLL)

Table I shows the LLL algorithm with full size reduction.In the following discussion, we refer to the lines in TableI. There are three main differences between FSR-LLL andthe conventional complex LLL algorithm1, although the latticereduced bases from both algorithms are still the same. First,the full size reduction (lines 4~10) is executed in each iterationof the while loop (line 3), which means that all columns ofR andT are size-reduced at the beginning of each iteration.The advantage here is that, once condition (10) is also fulfilledafter full size reduction (i.e., nok′ is found in line 11), thenthe FSR-LLL can immediately end the process (line 20). Forexample, suppose thatk equals 3 at current iteration. Since allcolumns inR andT are size-reduced after full size reduction,if no k′ can be found in line 11 (a search that a systolic arraycan make in parallel), then no further processing is neededin FSR-LLL. However, in the conventional LLL format, theprocess will end until columns 3 tom are sequentially size-reduced. With a systolic-array implementation, FSR-LLL isfaster, and its efficiency is especially apparent whenm is large.The second difference is that the Givens rotation (lines 13~16)is executed before the column swap (line 17). This is becausethe Givens rotation process can work in parallel with full sizereduction, whereas the columns swap cannot. This point willbe made clear in Section IV-A. Third, the QR decompositionQHH = R is considered as the input of the algorithm, instead

1For comparison, the interested readers can refer to the Table I in [11] forthe conventional complex LLL algorithm. The Table I and II inthis paper arepresented in the similar format as the one in [11]. All the simulation resultsrelated to the conventional LLL in this paper are also based on the same table.

TABLE ILLL A LGORITHM WITH FULL SIZE REDUCTION

TABLE I LLL ALGORITHM WITH FULL SIZE REDUCTION

, , ,

,

(1) Initialization

(2) 2

(3) While

(4) for , , 2

(5) for -1, ,1

(6)

H

H H

m

i j i j i i

INPUT

OUTPUT

k

k m

j m

i j

r r!

" "

"#

"""

Q , R

Q Q , R R T

T = I

Full Size Reduction

!

!

" #$% &

1: , 1: , , 1: ,

1: , 1: , , 1: ,

2

1, 1, 1

(7) :

(8) :

(9) end

(10) end

(11) Find the smallest between ~

such that

i j i j i j i i

m j m j i j m i

k k k k k

k k m

r r r

!!

$ % % % % %& & &

" &" &

%

& '

R R R

T T T

'

2 2

, 1, 1

1 1, 1: ,

2 , 1: ,

1 2

2 1

'-1: ', ' 1: ' 1: ', ' 1: '-1: '

:

(12) If exists

(13)

(14)

(15)

(16) : ,

k k k

k k k k k

k k k k k

H

k k k m k k k m k k

r

k

r r

r r

(

(

( (( (

% % %& &

% % % % %& &

% % % % %&

)

& & &

%

"

"

* +" , -

&. /" 0

Givens Rotation

G

R G R Q ,1: ' 1: ',1: :

(17) Swap columns -1 and in and

(18) : max{ -1,2}

(19) else

(20) : 1

(21) end

(22) end

H

n k k n

k k

k k

k m

&" 0

% %

%"

" )

G Q

Column Swap

R T

TABLE II LL WAP ATTICE EDUCTION LGORITHM

,

(1) Initialization

(2) =EVEN

(3) While (any swap is possible in lines (9) or (16) )

(4) Execute lines 4 ~ 10 in T

H H

INPUT

OUTPUT

order

" "Q , R

Q Q , R R T

T = I

Full Size Reduction

2 2 2

1, 1, 1 , 1, 1

able I

(5) If =EVEN

(6) If for all even

(7) go to line (13)

(8) else

(9)

k k k k k k k k

order

r r r r k& & & & && #

Givens Rotation and Column Swap

2 2 2

1, , 1, 1

Execute lines 13~17 in Table I

for all even between 2 ~

such that

(10) ODD

(11) end

(12)

k k k k k k

k m

r r

order

$ ! & & && '

2 2 2

1, 1, 1 , 1, 1

else

(13) If for all odd

(14) go to line (6)

(15) else

(16) Execute lines 13~17 in Table I

for all o

k k k k k k k kr r r r k& & & & && #

2 2 2

1, , 1, 1

dd between 2 ~

such that

(17) =EVEN

(18) end

(19) end

(20) end

k k k k k k

k m

r r

order

$ ! & & && '

TABLE III FPGA MPLEMENTATION ESULTS

Target

AlgorithmASLR FSR-LLL CLLL [11]

Device Virtex 5 Virtex 6 Virtex 5 Virtex 6 Virtex 4 Virtex 5

Slices2322

/20480

1812

/20000

2335

/20480

1798

/20000

3617

/67584

1712

/17280

Clock

Frequency160MHz 249MHz 155MHz 247MHz

140

MHz163 MHz

Avg.

cycles(time)

per channel

matrix

80 (SQRD) 84 (SQRD) 130 (SQRD)

500.0ns 321.3ns 541.9ns 340.1ns

146 (QRD) 164 (QRD) 928.6ns 797.5ns

912.5ns 586.3ns 1058.1ns 664.0ns

part number: XC5VFX130T part number: XC6VLX130T

of H = QR. From line 16, the Givens rotation matrixGapplies to the same two rows ofQH andR, which simplifiesthe design of the systolic array. Additionally, after FSR-LLL,QH is ready for calculating the pseudoinverse ofH for linear-detection.

B. All-Swap Lattice Reduction (ASLR) Algorithm

The ASLR algorithm is a variant of the LLL algorithm, andwas first proposed for real number lattices only [30]. Table IIdescribes its extension to a complex version. One significantdifference between FSR-LLL and ASLR is that every pairof columnsk and k − 1 with even (or odd) indexk couldbe swapped simultaneously. The algorithm begins with fullsize reduction, which is the same as FSR-LLL. Givens-rotationand column-swap operations (same as in Table I, lines 13~17)should be executed on all possible even (odd)k that violatethe condition in (10), and then start another iteration withtheindicator variable “order” set to odd (even). If condition (10)holds for all even (odd)k, Givens rotation and columns swapwill not be executed. Meanwhile, we can immediately checkfor all odd (even)k instead. MatrixR is already full-sizereduced, with no need to start the next iteration with full size

Page 5: TO BE APPEARED IN JOURNAL OF COMMUNICATIONS AND …

WANG et al.: SYSTOLIC ARRAYS FOR LATTICE-REDUCTION-AIDED MIMO DETECTION 5

TABLE IIALL SWAP LATTICE REDUCTION ALGORITHM

TABLE LLL LGORITHM WITH ULL IZE EDUCTION

, , ,

,

(1) Initialization

(2) 2

(3) While

(4) for , , 2

(5) for -1, ,1

(6)

H H

i j i j i i

INPUT

OUTPUT

k m

j m

i j

r r

" "Q , R

Q Q , R R T

T = I

Full Size Reduction

" #% &

1: , 1: , , 1: ,

1: , 1: , , 1: ,

1, 1, 1

(7) :

(8) :

(9) end

(10) end

(11) Find the smallest between ~

such that

i j i j i j i i

m j m j i j m i

k k k k k

k k m

r r r% % % % %& & &

" &" &

& '

R R R

T T T

2 2

, 1, 1

1 1, 1: ,

2 , 1: ,

1 2

2 1

'-1: ', ' 1: ' 1: ', ' 1: '-1: '

(12) If exists

(13)

(14)

(15)

(16) : ,

k k k

k k k k k

k k k k k

k k k m k k k m k k

r r

r r

( (( (

% % %& &

% % % % %& &

% % % % %

& & &

* +, -. /

" 0

Givens Rotation

R G R Q ,1: ' 1: ',1: :

(17) Swap columns -1 and in and

(18) : max{ -1,2}

(19) else

(20) : 1

(21) end

(22) end

n k k n

k k

k k

k m

" 0

% %

" )

G Q

Column Swap

R T

TABLE II ALL SWAP LATTICE REDUCTION ALGORITHM

,

(1) Initialization

(2) =EVEN

(3) While (any swap is possible in lines (9) or (16) )

(4) Execute lines 4 ~ 10 in T

H

H H

m

INPUT

OUTPUT

order

" "Q , R

Q Q , R R T

T = I

Full Size Reduction

2 2 2

1, 1, 1 , 1, 1

able I

(5) If =EVEN

(6) If for all even

(7) go to line (13)

(8) else

(9)

k k k k k k k k

order

r r r r k$ & & & & && #

Givens Rotation and Column Swap

2 2 2

1, , 1, 1

Execute lines 13~17 in Table I

for all even between 2 ~

such that

(10) ODD

(11) end

(12)

k k k k k k

k m

r r

order

$ ! & & && '"

2 2 2

1, 1, 1 , 1, 1

else

(13) If for all odd

(14) go to line (6)

(15) else

(16) Execute lines 13~17 in Table I

for all o

k k k k k k k kr r r r k$ & & & & && #

2 2 2

1, , 1, 1

dd between 2 ~

such that

(17) =EVEN

(18) end

(19) end

(20) end

k k k k k k

k m

r r

order

$ ! & & && '

TABLE III FPGA MPLEMENTATION ESULTS

Target

AlgorithmASLR FSR-LLL CLLL [11]

Device Virtex 5 Virtex 6 Virtex 5 Virtex 6 Virtex 4 Virtex 5

Slices2322

/20480

1812

/20000

2335

/20480

1798

/20000

3617

/67584

1712

/17280

Clock

Frequency160MHz 249MHz 155MHz 247MHz

140

MHz163 MHz

Avg.

cycles(time)

per channel

matrix

80 (SQRD) 84 (SQRD) 130 (SQRD)

500.0ns 321.3ns 541.9ns 340.1ns

146 (QRD) 164 (QRD) 928.6ns 797.5ns

912.5ns 586.3ns 1058.1ns 664.0ns

part number: XC5VFX130T part number: XC6VLX130T

reduction (Table II, line 7 or 14). If neither an even nor oddk violates condition (10) after full size reduction, the ASLRprocess ends.

C. Replacing Lovász condition with Siegel condition

From the previous discussion, it is clear that all basis vectorsare size reduced within one processing iteration of full sizereduction. Additionally, according to line 11 in Table I andlines 6 and 13 in Table II, the lattices processed by FSR-LLLand ASLR both satisfy the Lovász condition in (10). There-fore, we can conclude that these two algorithms also generateLLL-reduced lattice. Consequently, like the conventionalLLL,FSR-LLL-aided and ASLR-aided detection also achieves fullreceive diversity in MIMO system [11], [12].

The Lovász condition involves two diagonal elements andone off-diagonal element in the matrixR. In order to simplifythe data communication between processing elements in thesystolic array, we relax the Lovász condition by replacing itwith

δ −1

2≤

|ri,i|2

|ri−1,i−1|2 , 2 ≤ i ≤ m, (11)

whereδ lies in the range(1/2, 1) , the same as for Lovász con-dition. The condition (11) is also called Siegel condition [31],

and it is weaker than the Lovász condition because

δ −1

2≤ δ −

ri−1,i

ri−1,i−1

2

≤|ri,i|

2

|ri−1,i−1|2 , 2 ≤ i ≤ m. (12)

The first inequality follows from (9). Similar approximationas in (11) can be found in [34]. The advantage of using thisnew condition is that only two neighboring diagonal elementsof R are involved. We will have more discussion on theimpact of designing systolic array with this new conditionin Section IV. Another advantage comes from the fact thatthe new condition check can be done by taking the square-root in (11). In hardware implementation, it implies that wecan save precision bits by storing|ri,i|/|ri−1,i−1| rather than

|ri,i|2/

|ri−1,i−1|2. Additionally, the condition check can be

done without a division, simply by comparing the value of|ri,i| and

δ − 1/2 |ri−1,i−1| , where√

δ − 1/2 is a pre-computed constant onceδ is determined. In the balance ofthis paper, when we refer to FSR-LLL and ASLR we meanFSR-LLL and ASLRwith Siegel condition.

Since Siegel condition is weaker than Lovász condition,one might expect the performance of the lattice reductionalgorithm with condition (11) to be worsened. Yet, by aproof similar to that in [11], [12] we can show that theLLL algorithm with Siegel condition also achieves maximumreceive diversity in MIMO systems. In the proof of LLL-aideddetection achieving full diversity [11], [12], the key stepandthe only step involving the LLL-reduced conditions is that theorthogonality defectκ (κ ≥ 1) of the LLL-reduced basis setH is upper bounded by

κ∆=

∏m

i=1 ‖hi‖2

det (HHH)≤ 2−m

(

2

2δ − 1

)

m(m+1)2

, (13)

where hi’s are the columns ofH. In particular, (13) alsoholds for the lattices reduced by LLL algorithm with Siegelcondition. This can be justified by the same proof as in [11,Appendix B], whose details will be omitted in this paper.Hence, the LLL algorithm with the Lovász condition replacedby the Siegel condition also achieves maximum diversity inMIMO system. However, achieving maximum receive diver-sity does not automatically imply that the bit-error-rate (BER)performance is as good as using the conventional LLL algo-rithm. One can easily observe that ifδ is very close to1/2 ,condition (11) is almost always true. Thus, the Givens rotationand column swap steps in the reduction algorithm wouldseldom be performed, which causes the BER performance tobe much worse than with conventional LLL. On the contrary,as δ approaches1 one can expect the performance of FSR-LLL and ASLR to be closer to the conventional LLL. In Fig. 3,we show the empirical cumulative probability functions of theorthogonality defectκ for 4× 4 channel matrices under threedifferent reduction algorithms. The results of FSR-LLL andASLR overlap for all three values ofδ, which implies that theeffects of these two method on lattice reduction are almost thesame. Asδ = 0.99, FSR-LLL and ASLR give a result closeto the LLL with δ = 0.75, which is a very common settingas documented in previous works [8], [9], [12]. Forδ = 0.51and0.75, the gap between LLL and FSR-LLL (ASLR) is much

Page 6: TO BE APPEARED IN JOURNAL OF COMMUNICATIONS AND …

6 TO BE APPEARED IN JOURNAL OF COMMUNICATIONS AND NETWORKS

0 5 10 150

0.2

0.4

0.6

0.8

1

Orhtogonality defect κ

Em

piric

al c

umul

ativ

e pr

obab

ility

func

tion

LLLASLRFSR−LLLNo reduction

δ=.51

δ=.75

δ=.99

Fig. 3. The empirical cumulative probability functions of the orthogonalitydefect κ for the 4 × 4 channel matrices under three different reductionalgorithms.

larger than forδ = 0.99. In section IV-C, we will show thatfor δ equal to0.99, the BER performance of LR-aided lineardetections using FSR-LLL and ASLR is not worse than the oneusing the conventional LLL with the sameδ value. Based onthese results, in our systolic array design we chooseδ = 0.99.

IV. SYSTOLIC ARRAY FOR TWO LATTICE-REDUCTION

ALGORITHMS

From Fig. 2, the whole process of LRAD can be viewedas taking two steps: lattice reduction for the channel matrix,and detection. In this section, we exhibit our systolic arraydesign for LLL lattice reduction algorithm. The ensuing lineardetection or SIC on systolic array will be discussed in SectionV. In the following discussion, we assume that the channelmatrix has been QR decomposed. It is known that QRDcan be implemented in systolic array based on a series ofGivens rotations, since Given rotations can be executed ina parallel manner [20]–[22]. Since the conventional systolicarray for QRD usually contains square root operations, whichare computationally intensive in hardware implementation,a square-root-free systolic QRD based on Squared Givensrotations (SGR) can be used (the interested readers can referto [29], [35]). In [8], it is also shown that the sorted QRD(SQRD) can reduce the number of column swaps in the LLLalgorithm, and hence leads to less processing time. However,it also requires higher hardware complexity and latency toimplement SQRD than the conventional QRD [36].

A. Systolic Array for FSR-LLL

In the following, we assume a4 × 4 MIMO system (i.e.,m = 4, n = 4) and illustrate the proposed systolic algorithmin three parts: full size reduction, Givens rotation, and columnswap.

1) Full Size Reduction:The systolic array for the remainingparts of LRAD is shown in Fig. 4(a) . Four different kinds ofPEs are used, viz., diagonal cells, off-diagonal cells, vectoringcells, and rotation cells. For the full size reduction part,only

diagonal and off-diagonal cells are needed: the operationsof these two types of PEs are shown in detail in Fig. 4(b).The vectoring cell and rotation cell will be introduced withthe Givens rotation description. There is a slight differencebetween the off-diagonal cells in the upper-triangle part andthose in the lower-triangle part. Fig. 4(b) shows only the off-diagonal cell in the upper-triangle part. Those off-diagonalcells in the lower-triangle part haveyin and cin come fromthe top, whilecout leaves from the bottom. Except for thisminor difference in the data interface, the operations arethe same as the off-diagonal cells in the upper-triangle part.Additionally, in Fig. 4(b) the dotted lines represent the logiccontrol signals transmitted between cells, and the solid linesrepresent the data transmitted. To initialize the process,eachelement of the matricesR and QH (denoted asr and q,respectively, in Fig. 4(b)) from QR decomposition are storedin the PE at the corresponding position. For example,qi,i andri,i are stored in the corresponding diagonal cellDii. The off-diagonal elementsqi,j and ri,j are stored in the off-diagonalcell Oij . Additionally, the elements of the unimodular matrixT (denoted ast in Fig. 4(b)) are also stored in the arrays, withT initially set to the identity matrix.

Fig. 5 shows the overall processes of the full size reductionin the systolic array. In this stage, two major processing modesare defined in each diagonal and off-diagonal cell, thesizereduction modeand thedata modeas detailed in Fig. 4(b). Inthe size reduction mode, the objective of each cell is to makecondition (9) valid. On the other hand, the cell only performsdata propagation in thedata mode. The cell decides to work ineither mode depending on the occurrence of the logic controlsignal “#”. For simplicity, we assume the cells execute alloperations in thedata modeor thesize-reduction modein onenormalized cycle2. At T = 0, the external controller sends inthe logic control signal “#” to cellD33 through cellD44. AtT = 1, cell D33 works in thedata modedue to the controlsignal “#” and spreads out the “#” logic control signal tothe neighboring 3 cells. Meanwhile,D33 sends out the data(r3,3, t3,3)

(∗) to cellO34. Note that the superscript (*) is a tagbit attached to the data, which indicates that the data are sentout by a diagonal cell. The occurrence of a tag bit (*) will drivethe off-diagonal cell to computeµ, and useµ to update the datastored in that cell. As a result, atT = 2, cell O34 sends outthe newly computedµ to the two neighboring cellsO24 andD44. At next time instant (T = 3), theµ signal generated byO34 meets the data coming from cellO23 (O43) inside the cellO24(D44), and executes the size reduction update. At the sametime instant, data(r2,2, t2,2)(∗) enter cellO23. As cell O34

did at T = 2, cell O23 computesµ, updates(r2,3, t2,3), andsends outµ to the neighboring cellsO13 andD33. The mostimportant fact here is that cellO23 also propagates the data(r2,2, t2,2)

(∗) to cellO24, and thus starts the column operationsbetween column 2 and column 4 atT = 4. Similarly, thecolumn operations between column 1 and column 4 begins atT = 6 as (r2,2, t2,2)

(∗) enter cellO14. Essentially, full sizereduction is a series of column operations between columnjand columnsj − 1, j − 2, · · · , 1, for all 2 ≤ j ≤ m, and we

2The real hardware cycle counts could be multiples of the normalized cycle.

Page 7: TO BE APPEARED IN JOURNAL OF COMMUNICATIONS AND …

WANG et al.: SYSTOLIC ARRAYS FOR LATTICE-REDUCTION-AIDED MIMO DETECTION 7

n+1

11 12

21 22

11

21 12

22

1121

1222

1121

1222

12

3

time

n+2n+3

(n: latency)

(a)

in

out

outinout in

out in in

a =a

c =c +a b

(b)

Fig. 1. (a) Two-dimensional systolic array performing matrix

calculation D = A B + C , where , , ,ij ij ij ija b c d are the (i,j) entries of the

matrix , , , and . (b) The operation of each processing element.

D44

O13O12

O21 O23

O14

O24

D11

D22

O31 O32 O34D33

O41 O42 O43 D44

O13O12

O21 O23

O14

O24

D11

D22

O31 O32 O34D33

O41 O42 O43

(a)

(*)

"#"

; ( , )

"true", if ="false",

:= - ; ,

in

out out

out out

in

out outin in in in

If m ="#"

m c

d r x r t

d rswapotherwise

(Default)

t t y x y y x x

' '' '

! ' '

Data mode

Size Reduction mode

( )

Diagonal cell Dii

in

i nout

q, t, rinxout

in

out

outout

wap

out

Off-diagonal cell Oij

q, t, rinxout

out

in inout

out in ( , );

If carries "*":

/

: - , : - ;

If doesn't carry "*":

in

out out in

in

in

in in

out out in

in

(If c ="#")

x r t c c

(Default)

r x

r r x t t xy x x

* *

' '

' ! ' !' '

Data mode

Size Reduction mode

!" #$ %

: , : ;

in in in in

out outin in

r y x t t y xy y x x' ) ! ' ) !' '

(b)

Fig. 4. (a) The systolic array for the linear LRAD of 4 4 MIMO system. (b)

The operations of diagonal and off-diagonal cells in the systolic array. (“*” is

an indicator bit used to control the flow of the algorithm, as explained in

Section IV.A)

10 15

0.2

0.4

0.6

0.8

Orhtogonality defect

Em

piric

al cu

mu

lative

pro

ba

bili

ty fu

nctio

n

LLL

ASLR

FSR-LLL

No reduction

=.51

=.75

=.99

Fig. 3. The empirical cumulative probability functions of the orthogonality

defect for the 4 4 channel matrices under three different reduction

algorithms.

H = H T

Lattice Reduction

Rounding

y( )

H = H T

Lattice Reduction

Rounding

y( )

Fig. 2. Block diagram of linear lattice-reduction-aided detection

(a)

n+1

11 12

21 22

11

21 12

22

1121

1222

1121

1222

12

3

time

n+2n+3

(n: latency)

(a)

in

out

outinout in

out in in

a =a

c =c +a b

(b)

Fig. 1. (a) Two-dimensional systolic array performing matrix

calculation D = A B + C , where , , ,ij ij ij ija b c d are the (i,j) entries of the

matrix , , , and . (b) The operation of each processing element.

44

1312

21 23

14

24

11

22

31 32 3433

41 42 43 44

1312

21 23

14

24

11

22

31 32 3433

41 42 43

(a)

(*)

2 2

"#"

; ( , )

1"true", if = 2"false",

:= - ; ,

in

out out

out out

in

out outin in in in

If m ="#"

m c

d r x r t

d rswapotherwise

(Default)

t t y x y y x x

"#$%$&

' '' '

( )

! ' '

Data mode

Size Reduction mode

( )

Diagonal cell Dii

inm

i nyoutc

q, t, rinxoutx

ind

outd

outyoutc

swap

outm

Off-diagonal cell Oij

q, t, rinxoutx

outc

inciny

outy

outyiny ( , );

If carries "*":

/

: - , : - ;

If doesn't carry "*":

in

out out in

in

in

in in

out out in

in

(If c ="#")

x r t c c

(Default)

x

r x

r r x t t xy x x

xr

** **

' '

'' ! ' !' '

Data mode

Size Reduction mode

!" #$ %

: , : ;

in in in in

out outin in

r y x t t y xy y x x' ) ! ' ) !' '

(b)

Fig. 4. (a) The systolic array for the linear LRAD of 4 4 MIMO system. (b)

The operations of diagonal and off-diagonal cells in the systolic array. (“*” is

an indicator bit used to control the flow of the algorithm, as explained in

Section IV.A)

10 15

0.2

0.4

0.6

0.8

Orhtogonality defect

Em

piric

al cu

mu

lative

pro

ba

bili

ty fu

nctio

n

LLL

ASLR

FSR-LLL

No reduction

=.51

=.75

=.99

Fig. 3. The empirical cumulative probability functions of the orthogonality

defect for the 4 4 channel matrices under three different reduction

algorithms.

H = H T

Lattice Reduction

Rounding

y( )

H = H T

Lattice Reduction

Rounding

y( )

Fig. 2. Block diagram of linear lattice-reduction-aided detection

(b)

Fig. 4. (a) The systolic array for the linear LRAD of4× 4 MIMO system.(b) The operations of diagonal and off-diagonal cells in thesystolic array.(“*” is an indicator bit used to control the flow of the algorithm, as explainedin Section IV-A)

can conclude the following facts for anm×m MIMO system:[Fact 1] In this systolic flow, the column operation betweencolumnj and columni (i < j) begins atT = m+ j − 2i as(ri,i, ti,i)

(∗) enters cellOij .Proof: Data (ri,i, ti,i)

(∗) leaves cellDii at T = m − i,and it takesj − i cycles to have(ri,i, ti,i)(∗) propagates fromcell Dii to cell Oij .[Fact 2] All column operations on columnj end atT =2m+ j − 3 in cell Omj .

Proof: In this systolic flow, the last column operation oncolumn j is always between columnj and column 1, whichstarts atT = m + j − 2 in cell O1j according to fact 1. Ittakesm− 1 more cycles to propagateµ from cell O1j to cellOmj and finish the column operation.

10 12 14 16

10

15

20

25

30

35

40

Avera

ge n

um

er

of

colu

mn s

waps FSR-LLL ( =.99)

ASLR ( =.99)

LLL( =.99)

LLL ( =.75)

Fig. 8. The average number of column swaps in FSR-LLL, ASLR and

LLL-aided MMSE detection in m m MIMO system with fixed at 20

dB.

10 15 20 25 3010

-5

10-4

10-3

10-2

10-1

10

/N (dB)

Bit-E

rro

r-R

ate

(B

ER

)

MMSE-FSR

MMSE-ASLR

MMSE-LLL

MMSE

ML

8x84x4

4x4

8x8

(a)

10 12 14 16 18

10-4

10-3

10-2

10-1

10

/N (dB)

Bit-E

rro

r-R

ate

(B

ER

)

MMSE-FSR-SIC

MMSE-ASLR-SIC

MMSE-LLL-SIC

ML

(b)

Fig. 7. BER performance of FSR-LLL and ASLR –based MMSE LRAD.

(a)Linear detection ( 4 4 and 8 8 MIMO systems) (b)SIC (an 4 4 MIMO

systems)

wap#

%

vectoring cell rotation cell

in out

If " "

( )

wap true

& ' & '( ) ( )

* +* +, " ( )in

out in

#& '( )* +

, "

" ,"

###

%%

Fig. 6. The operations of vectoring cells and rotation cells in the systolic array.

T=0 T=1

*

T=2

*

T=3

*

*

T=4

*

T=5

*

T=6 T=7 T=8

T=9

* (*)( , )ii iir t

( , )ij ijr t

"#"

-

Data mode

Size Reduction mode

T=9

Fig. 5. Flow chart of the full size reduction operations in the systolic array.Fig. 5. Flow chart of the full size reduction operations in the systolic array.

[Fact 3] The full size reduction ends atT = 3m − 3, whenall updates on columnm are done.

Proof: The full size reduction ends when columnm finishall the column operations. Therefore, it follows the resultinfact 2 that the last step is atT = 3m− 3.

Referring back to the example mentioned in Section III-A,we can have a more concrete view about the advantage ofFSR-LLL over the conventional LLL form when a systolicarray is used. If FSR-LLL is applied, the systolic array takesa total of 3m − 3 cycles to end the all processes. However,for non-systolic LLL, it takes2m+ j − 3 to process columnj, and all column operations cannot be done in parallel. Sothe total time to perform size reduction in non-systolic LLLwould be

∑m

j=3 (2m+ j − 3) = 2.5m2 − 6.5m + 3 cyclesin that example. In this case, asm increases beyond 3, theadvantage of FSR-LLL over the conventional format becomessignificant.

2) Givens Rotation:As mentioned in Section III-C, weuse Siegel condition in the lattice reduction algorithm, whichonly relates twor elements in the neighboring diagonalcells. Hence, this condition can be checked during a fullsize reduction step. For example, in Fig. 5 atT = 1,

Page 8: TO BE APPEARED IN JOURNAL OF COMMUNICATIONS AND …

8 TO BE APPEARED IN JOURNAL OF COMMUNICATIONS AND NETWORKS

10 12 14 16

10

15

20

25

30

35

40

Avera

ge n

um

er

of

colu

mn s

waps FSR-LLL ( =.99)

ASLR ( =.99)

LLL( =.99)

LLL ( =.75)

Fig. 8. The average number of column swaps in FSR-LLL, ASLR and

LLL-aided MMSE detection in m m MIMO system with E N fixed at

20 dB.

10 15 20 25 3010

-5

10-4

10-3

10-2

10-1

10

/N (dB)

Bit-E

rro

r-R

ate

(B

ER

)

MMSE-FSR

MMSE-ASLR

MMSE-LLL

MMSE

ML

8x84x4

4x4

8x8

(a)

10 12 14 16 18

10-4

10-3

10-2

10-1

10

/N (dB)

Bit-E

rro

r-R

ate

(B

ER

)

MMSE-FSR-SIC

MMSE-ASLR-SIC

MMSE-LLL-SIC

ML

(b)

Fig. 7. BER performance of FSR-LLL and ASLR –based MMSE LRAD.

(a)Linear detection ( 4 4 and 8 8 MIMO systems)(b)SIC (an 4 4 MIMO

systems)swap

"

# $

% 0

"

vectoring cell rotation cell

in" out"If " "

( )0

swap true

G##%

& ' & '( ) ( )

* +* +

,

$, " ( )in

out in

G##%%

& ' & '( ) ( )

* +* +

$, "

$

" ,"

## $#

% $%

Fig. 6. The operations of vectoring cells and rotation cells in the systolic array.

T=0 T=1 T=2

T=3 T=4 T=5

T=6 T=7 T=8

T=9

(*)( , )ii iir t

( , )ij ijr t

"#"

Data mode

Size Reduction mode

T=9

Fig. 5. Flow chart of the full size reduction operations in the systolic array.

Fig. 6. The operations of vectoring cells and rotation cellsin the systolicarray.

cell D33 sends datar3,3 to cell D22 along with the “#”signal. At the next time instant, cellD22 will check thiscondition based on|r3,3|

2/

|r2,2|2, and also generate the logic

control signal “swap” (see Fig. 4(b)). Ifδ − 1/2 is greater

than |ri,i|2/

|ri−1,i−1|2 then “swap” is “ true”, and drives

the vectoring cell to work. The operations of vectoring androtation cells are shown in Fig. 6. The vectoring cell zeros outthe input dataβ by the Givens rotation matrixG, which iscalculated based on Table I lines 13 to 15. The rotation cellsimply rotates the input data with the angleΘ given by theneighboring vectoring cell. Hence, the vectoring and rotationcells also work in a systolic way, with the rotation angleΘ propagating between cells. As shown in Fig. 4(a), thereare 3 rotation cells and 1 vectoring cell between every twoconsecutive rows of the systolic array. These cells performtheGivens rotation to theR andQH data in those two rows. Thevectoring cell is located between cellsDii andOi−1,i becausethe Givens rotation step is executed prior to the column-swapstep in FSR-LLL, and datari,i need be zeroed so that thematrix R is still upper triangular after column swap.

Note that Givens rotation only applies to rowsk′ andk′ −1 during one iteration of FSR-LLL ifk′ exists (lines 13~16in Table I). However, everyDii (i = 1, · · · ,m − 1) couldgenerate the “swap” signal during the full size reduction step.Therefore, we need a direct access from the external controllerto each diagonal cell in order to control the data path betweenthe diagonal cell and the vectoring cell. Namely, only cellDk′k′ can pass the signal “swap” to the vectoring cell andperform the Givens rotation to rowsk′ andk′−1. In Fig. 4(a),we use a “switch” symbol between each pair of a diagonal celland a vectoring cell to represent the control by the externalcontroller. Only one switch is turned on during one iteration.

Additionally, a Givens rotation on rowsk′ and k′ − 1can begin right afterrk′−1,k′ is updated during the full sizereduction step. For example,r3,4 is updated atT = 2 as shownin Fig. 5, and Givens rotation on rows 3 and 4 could start asearly asT = 3 without any interference to the remainingoperations of full size reduction. This way, the time necessaryto perform Givens rotations can be partially hidden by thefull size reduction and this is the reason why we want theGivens rotation to occur prior to column swap in our design.For hardware implementation, one could consider using onlyone rotation cell between every two neighboring rows or thesystolic array to reduce the hardware complexity. This willnotlead to significant increase in time if we consider performingGivens rotation and full size reduction in parallel.

3) Column swap:The columnsk′ andk′−1 of R (andT)should be swapped, after the Givens rotation is done. However,it is possible that the column swap be partially overlapped intime with size reduction and Givens rotation. For example, thecolumn swap could begin afterR being rotated but prior toQH being updated since there is no need to swap columns ofQH .

The FSR-LLL stops when there is no possible column swap,i.e., ak′ in Table I, line 11, does not exist. The system flow(lines 3, 18 and 20 in Table I) is controlled by the externalprocessor. The lattice reduced matricesR and QH and theunimodular matrixT stay in the PEs. The systolic arrayalong with these matrices will be used for linear detection,as described in Section V below.

B. All-Swap Lattice Reduction (ASLR) Algorithm

The ASLR algorithm can also be performed by the systolicarray shown in Fig. 4(a). The process of full size reduc-tion is the same as in Fig. 5. During full size reduction,the Siegel condition is also checked in each diagonal cellD11~Dm−1,m−1. If the current value of “order” is even (odd),then the “switch” between each cellDk−1,k−1 with even (odd)index k and the vectoring cell is turned on by the externalcontroller. Consequently, for every even (odd) indexk, Givensrotation between rowsk − 1 and k could be executed ifneeded. As for the column swap step, more than one pairof columns could be swapped during one iteration, but allthese pairs are swapped in parallel. Hence, the time spenton columns swap is the same as on swapping a single pairof columns. Based on this observation, we can expect thesystolic ASLR to work more efficient than the systolic FSR-LLL. Comparisons between these two algorithms in terms ofbit-error-rate performance and of efficiency in execution timeare deferred to the next subsection.

Note that in our description we limit the applications ofthis systolic array only to anm × m MIMO system. Form ×m MMSE-LRAD, although the matrixQH is m × 2m(the extended channel model in (5)), we can treat the subma-trix QH

1:m,(m+1):2m as another square matrix, and store eachelement ofQH

1:m,(m+1):2m in the PE at the correspondingposition. Namely,qi,j andqi,j+m should be stored in the samePE, which still keeps the systolic array square.

C. Comparison between FSR-LLL and ASLR algorithm

First, we compare the two algorithms in term of bit-error-rate (BER) performance, and also compare them with theconventional LLL algorithm. In our simulation, 4-QAM isassumed for the transmitted symbols. The constantδ is set to0.99 in all algorithms for fair comparison. LetEb be definedas the equivalent energy per bit at the receiver, and thusEb/N0 ism/(σ2 log2 M). The Fig. 7(a) shows the BER resultsof minimum mean-square-error LRAD (in4 × 4 and 8 × 8MIMO systems) based on FSR-LLL (denoted as MMSE-FSR),ASLR algorithm (denoted as MMSE-ASLR) and the LLLalgorithm (denoted as MMSE-LLL). The BER results for MLdetection and MMSE without lattice reduction are also shownfor comparison. Asδ = 0.99, the FSR-LLL and ASLR work as

Page 9: TO BE APPEARED IN JOURNAL OF COMMUNICATIONS AND …

WANG et al.: SYSTOLIC ARRAYS FOR LATTICE-REDUCTION-AIDED MIMO DETECTION 9

0 5 10 15 20 25 3010

−5

10−4

10−3

10−2

10−1

100

Eb/N

0(dB)

Bit−

Err

or−

Rat

e (B

ER

)

MMSE−FSRMMSE−ASLRMMSE−LLLMMSEML

8x84x4

4x4

8x8

(a)

0 2 4 6 8 10 12 14 16 18

10−4

10−3

10−2

10−1

100

Eb/N

0(dB)

Bit−

Err

or−

Rat

e (B

ER

)

MMSE−FSR−SICMMSE−ASLR−SICMMSE−LLL−SICML

(b)

Fig. 7. BER performance of FSR-LLL and ASLR- based MMSE LRAD.(a)Linear detection (4×4 and8×8 MIMO systems) (b)SIC (an4×4 MIMOsystems)

well as LLL algorithm, and even slightly better in the case ofm = 8. It clearly shows that using the insignificantly weakerSiegel condition does not deteriorate the BER performanceof linear detections in an MIMO system as compared to theconventional LLL. In Fig. 7(b), the BER performance of an4 × 4 MIMO system using LR-aided MMSE SIC based ondifferent lattice reduction algorithms are shown. Unlike thelinear detection case, the LLL-aided SIC works better than theother two algorithms. Since the detection of the first layer inSIC dominates the overall performance, it implies that due toSiegel condition the FSR-LLL-reduced or the ASLR-reducedchannel provides lower SNR for the first layer in SIC thanthe one given by the conventional LLL. Additionally, FSR-LLL and ASLR lead to almost the same results in all threeMIMO systems, which is consistent with the results in Fig. 3.Hence, we can conclude that although FSR-LLL and ASLRgive different lattice reduced matrices, the LRAD based onthese two algorithms have very similar BER performance.

Next, we compare the efficiency of the systolic array forboth algorithms. It is known that the number of iterationsof FSR-LLL and ASLR depends on the condition number ofthe channel matrix. IfH is well-conditioned, lattice reduction

4 6 8 10 12 14 160

5

10

15

20

25

30

35

40

m

Ave

rage

num

er o

f col

umn

swap

FSR−LLL (δ=.99)ASLR (δ=.99)LLL(δ=.99)LLL (δ=.75)

Fig. 8. The average number of column swaps in FSR-LLL, ASLR and LLL-aided MMSE detection inm × m MIMO system withEb/N0 fixed at 20dB.

4 6 8 10 12 14 160

1

2

3

4

5

6x 10

4

m

Ave

rage

num

ber

of fl

op

FSR−LLL (δ=.99)ASLR (δ=.99)LLL (δ=.99)LLL(δ=.75)

Fig. 9. The average number of floating point operations in FSR-LLL, ASLRand LLL-aided MMSE detection inm×m MIMO system withEb/N0 fixedat 20 dB.

takes less iterations, and thus less cycles in the systolic array.Since both algorithms begin with full size reduction, the totalexecution time is fully determined by the number of columnswaps in the overall process. Less column swapping impliesless iterations. Fig. 8 shows the average number of columnswaps in FSR-LLL and ASLR-aided MMSE detection (withEb/N0 fixed at 20dB) inm×m MIMO systems (m = 4~16).Note that for ASLR we count all the even or odd columnsswaps during one iteration as only one swap since they areexecuted in parallel. In an4×4 MIMO, the difference betweenthe two algorithms is almost negligible. However, as thenumber of antennas grows, the advantage of ASLR becomessignificant. Form ≥ 8, ASLR has less than 65% the columnswaps comparing to FSR-LLL. Based on BER performanceand time-efficiency comparisons, ASLR should be a betteralgorithm to be applied on our systolic array, especially witha large number of antennas.

For comparison, the results of the conventional LLL withδ = 0.99 and 0.75 are also shown in Fig. 8. As expected,

Page 10: TO BE APPEARED IN JOURNAL OF COMMUNICATIONS AND …

10 TO BE APPEARED IN JOURNAL OF COMMUNICATIONS AND NETWORKS

0 5 10 15 20 25 3010

−5

10−4

10−3

10−2

10−1

100

Eb/N

0

Bit−

Err

or−

Rat

e (B

ER

)

ASLR & FSR−LLL(floating−point)ASLR (fixed−point)FSR−LLL (fixed−point)

Fig. 10. Comparison between the fixed-point and floating-point latticereduction algorithms using ZF-SIC in an4× 4 MIMO system

LLL with δ = 0.99 has a higher complexity than LLL withδ = 0.75. Furthermore, the conventional LLL has a muchhigher average number of column swaps than FSR-LLL andASLR have in the higher-dimensional MIMO system (m ≥ 8).However, it is not fair to conclude that the complexities ofFSR-LLL and ASLR are much lower than the conventionalLLL; in fact, full size reductions are performed in the formertwo algorithms, and full size reduction needs more computa-tion efforts than the conventional size reduction in LLL. InFig. 9, we compare the number of floating point operations(flop) in LLL, FSR-LLL, and ASLR using the same settingsas in Fig. 8. The flops are counted in terms of number ofreal additions and real multiplications. One complex additionis counted as two flops (two real additions) and one complexmultiplication is counted as six flops (four real multiplicationsand two real additions). The complexity of QR decompositionis neglected, since this is done only once at the beginningof the three algorithms. It is shown that LLL withδ = 0.99has the highest complexity among the three. Under the sameδ (= 0.99) setting, FSR-LLL and ASLR have a much lowercomputational complexity than LLL. On the other hand, thecomplexity of LLL with δ = 0.75 is just slightly higher thanFSR-LLL and ASLR, even though the average number ofcolumn swaps of LLL withδ = 0.75 is more than two timeslarger than the one of ASLR form ≥ 10. This implies thatthe process of full size reduction introduces some additionalcomplexity. However, thanks to the (insignificantly) weakerSiegel condition, the complexities of ASLR and FSR-LLL form ≥ 10 are less than 50% of the complexity of LLL with thesameδ setting.

To further explore the advantage of using systolic array,we implement our proposed architecture for an4× 4 MIMOsystem onto FPGA. We performed our design using XilinxSystem Generator 11.5 (XSG) block-set in the Simulink de-sign environment. A Verilog Hardware Description Language(HDL) code is then generated automatically by XSG and issynthesized by Xilinx XST. The place and route is done byXilinx ISE 11.5. The word-length ofR, QH , T andµ are setto (18,13), (14,13), (8,0) and (3,0), respectively. As mentioned

TABLE IIIFPGA IMPLEMENTATION RESULTS

TABLE LLL LGORITHM WITH ULL IZE EDUCTION

, , ,

,

(1) Initialization

(2) 2

(3) While

(4) for , , 2

(5) for -1, ,1

(6)

H H

i j i j i i

INPUT

OUTPUT

k m

j m

i j

r r

" "Q , R

Q Q , R R T

T = I

Full Size Reduction

" #% &

1: , 1: , , 1: ,

1: , 1: , , 1: ,

1, 1, 1

(7) :

(8) :

(9) end

(10) end

(11) Find the smallest between ~

such that

i j i j i j i i

m j m j i j m i

k k k k k

k k m

r r r% % % % %& & &

" &" &

& '

R R R

T T T

2 2

, 1, 1

1 1, 1: ,

2 , 1: ,

1 2

2 1

'-1: ', ' 1: ' 1: ', ' 1: '-1: '

(12) If exists

(13)

(14)

(15)

(16) : ,

k k k

k k k k k

k k k k k

k k k m k k k m k k

r r

r r

( (( (

% % %& &

% % % % %& &

% % % % %

& & &

* +, -. /

" 0

Givens Rotation

R G R Q ,1: ' 1: ',1: :

(17) Swap columns -1 and in and

(18) : max{ -1,2}

(19) else

(20) : 1

(21) end

(22) end

n k k n

k k

k k

k m

" 0

% %

" )

G Q

Column Swap

R T

TABLE II LL WAP ATTICE EDUCTION LGORITHM

,

(1) Initialization

(2) =EVEN

(3) While (any swap is possible in lines (9) or (16) )

(4) Execute lines 4 ~ 10 in T

H H

INPUT

OUTPUT

order

" "Q , R

Q Q , R R T

T = I

Full Size Reduction

2 2 2

1, 1, 1 , 1, 1

able I

(5) If =EVEN

(6) If for all even

(7) go to line (13)

(8) else

(9)

k k k k k k k k

order

r r r r k& & & & && #

Givens Rotation and Column Swap

2 2 2

1, , 1, 1

Execute lines 13~17 in Table 1

for all even between 2 ~

such that

(10) ODD

(11) end

(12)

k k k k k k

k m

r r

order

$ ! & & && '

2 2 2

1, 1, 1 , 1, 1

else

(13) If for all odd

(14) go to line (6)

(15) else

(16) Execute lines 13~17 in Table 1

for all o

k k k k k k k kr r r r k& & & & && #

2 2 2

1, , 1, 1

dd between 2 ~

such that

(17) =EVEN

(18) end

(19) end

(20) end

k k k k k k

k m

r r

order

$ ! & & && '

TABLE III FPGA IMPLEMENTATION RESULTS

Target

AlgorithmASLR FSR-LLL CLLL [14]

Device Virtex 51 Virtex 62 Virtex 51 Virtex 62 Virtex 4 Virtex 5

Slices 2322

/20480

1812

/20000

2335

/20480

1798

/20000

3617

/67584

1712

/17280

Clock

Frequency160MHz 249MHz 155MHz 247MHz

140

MHz163 MHz

Avg.

cycles(time)

per channel

matrix

80 (SQRD) 84 (SQRD) 130 (SQRD)

500.0ns 321.3ns 541.9ns 340.1ns

146 (QRD) 164 (QRD) 928.6ns 797.5ns

912.5ns 586.3ns 1058.1ns 664.0ns1part number: XC5VFX130T 2part number: XC6VLX130T

in Section III-C, the division in Siegel condition check canbeavoided by using a comparator. The divisions in the Givensrotation are implemented by the Newton-Raphson iterativealgorithm [37]. As forµ, it can be easily shown by simulationthat |µ| is either 0, 1, or 2 over 99.7% of the time. Hence, wecan simply use a set of comparators to determine the value ofµ instead of using a division. For those|µ| greater than 2 aresaturated to 2, which rarely happened. The BER performanceof the fixed-point systolic implementation for an4× 4 MIMOsystem is shown in Fig. 10, where 16-QAM modulation andZF-SIC detection are applied. The implementation results areshown in Table III. We consider both QRD and SQRD asthe pre-processes of the lattice reduction algorithms. From theresults, ASLR is superior to FSR-LLL in terms of the averageprocessing time, and this advantage is significant when QRDis applied. The hardware complexity for ASLR and FSR-LLLare about the same, since they only differ from each otherin the external controllers. It is also clear that SQRD reducesthe average processing time by over 45% comparing to usingthe normal QRD, at the cost of higher computation efforts onSQRD.

In Table III, the FPGA implementation result for the conven-tional complex LLL (CLLL) [14] is also listed for comparison.Under Virtex 5 and with SQRD, systolic ASLR operates ata slightly lower speed than the one of CLLL; however ourdesigns require only 61.5% average clock cycles of theirs. Asa result, ASLR is on average faster than CLLL by a factor of1.6. This verifies the high-throughput advantage of the systolicarrays. On the other hand, systolic arrays implementation mayhave higher hardware complexity since it requires severalprocessing elements to work in parallel. The results in Table IIIshows that our designs occupied 36~38% more FPGA slicesthan the one in CLLL. However, as the fast the advance ofFPGA technology and the semiconductor processing, one mayconsider to trade some areas for a faster processing speed. Asshown in Table III, when using the latest Xilinx Virtex 6 FPGAdevice, our systolic designs could run up to 249MHz and itonly requires less than 10% of the total FPGA slices.

V. SYSTOLIC ARRAY FOR DETECTION METHODS

A. Linear Detection in Systolic Array

After lattice reduction, the matricesQH andR, along withthe unimodular matrixT, are stored in the systolic array. As

Page 11: TO BE APPEARED IN JOURNAL OF COMMUNICATIONS AND …

WANG et al.: SYSTOLIC ARRAYS FOR LATTICE-REDUCTION-AIDED MIMO DETECTION 11

shown in Fig. 2, the first step of a linear detection consistsof premultiplying the received signal vectory by H†, whichyields x = H†y = R−1QHy. Second, the result of a matrix–vector multiplication needs to be rounded element-wise. Thefinal step is to multiply the rounded results by the unimodularmatrix T and constrain all results within the constellationboundary. Ifxq denotes the element-wise-roundedx, the finaldecision of the LRAD isxLR = Q(T · xq), as described inSection II-C.

In the following discussion, we assume an4 × 4 MIMOsystem, and consider the zero-forcing detection first. The firstand last steps of a linear detection can be implemented by thesame systolic array of Fig. 4 without using extra cells. As forthe rounding and the final constellation boundary check, theyshould be done outside the systolic array (they are not shownin Fig. 11). To executex = R−1QHy in the systolic array, weseparate it into two matrix–vector multiplicationsv = QHy

and thenx = R−1v. SinceQH stays in the systolic arraysafter the lattice reduction ends, the received signal vector y canbe fed to the systolic arrays from the top in a skewed manneras shown in Fig. 11(a). The vectorQHy is pumped out fromthe rightmost column of the array. Diagonal and off-diagonalcells are needed at this stage, and the operations of the cellsare shown in Fig. 12(a). Every cell performs the multiply-and-add operation. If MMSE is chosen, the input vector should bechanged to an2m × 1 vector y according to the extended

model (5). Lety =[

yT

1yT

2

]Tand QH = [Q1 Q2] , where

y1, y2 arem × 1 vectors andQ1, Q2 arem ×m matrices.As mentioned in Section IV-B, the elements ofQ1 andQ2

are stored in the same PEs. To computev = QHy usingthe systolic array, first we lety1 enter the array from thetop and multiply it byQ1, which is the same as shown inFig. 11(a). Theny2 enters the array right aftery1, also in askewed manner, and is multiplied byQ2. Hence, for MMSEwe need an extra operation at the output of the array, whichis v = Q1y1 + Q2y2. For the remaining operations in thesystolic array, there is no difference between ZF and MMSEdetections.

The second stage consists of computingx = R−1v.Instead of computingR−1 directly, the following recursiveequation [38] is considered for the systolic design

xj =1

rj,j

vj −

m∑

i=j+1

rj,ixi

, j starts from m to 1. (14)

According to (14), it is clear thatR−1v can be computeddirectly from the components ofR without computingR−1.Additionally, it can be implemented by the upper triangle partof the systolic array, where matrixR has already been stored.As shown in Fig. 11(b), the vectorv = QHy enters the arrayfrom the right, andx = R−1v is computed by the triangulararray with cell operations shown in Fig. 12(b). The outputvector x is then rounded element-wise outside the systolicarray. The final step consists of multiplying the quantizedvector xq by the unimodular matrixT, which is also storedin the array. Similar to the first step of a linear detection, itis a matrix–vector multiplication betweenT and xq. Hence,the data flow in Fig. 11(c) is the same as Fig. 11(a). The cell

q, t, r in

Diagonal cell Dii

q, t, r

Off-diagonal cell Oij

inout

in

out out

diagonal celloff-diagonal cell

j=i+1

off-diagonal cell

j>i+1

out iny r out in in

out in

x r y

y y

! "out in in

out in

x r y

y y

! "

(a) (b)

Fig. 12. (a) The data flow and (b) the detailed operations of the cells in the

systolic array for the interference-cancellation step of LR-aided SIC.

Diagonal cell Dii Off-diagonal cell Oij

q, t, rinx out

in

out

q, t, rinx out

in

out

operation diagonal and off-diagonal cells

Q y

T x

; out outin in inx q y y y # "

; out outin in inx t y y y # "

q, t, rinx

Diagonal cell Dii

q, t, r

Off-diagonal cell Oij

inxout

in

outout

operation diagonal cell off-diagonal cell

R vout in in

out in

x r y

y y

! "out iny y r

(a) (b)

Fig. 11. The detailed operations of the diagonal cells and off-diagonal cells in

the systolic array at different stage. (a) Q y and T x (b) R v .

y1

y2

y3

y4

y1

y2

y3

y4

y

v4

v3

v2

v1

v4

v3

v2

v1

H v Q y"

v

x

v4

v3

v2

v1

v4

v3

v2

v1

v

1x

1ˆ ! x R v"

2x

3x

4x

ˆqx

ˆLR

x

1ˆqx

1ˆLRx

ˆ ˆ( )LR q "x T x 2ˆqx

3ˆqx

4ˆqx

2ˆLRx

3ˆLRx

4ˆLRx

(a) (b) (c)

Fig. 10. The linear detection operations in the systolic array. (a) v Q y (b)

x R v (c) ˆ ˆ( )LR q "x T x .

10 12 14 16

x 10

Avera

ge n

um

ber

of

flops

FSR-LLL ( =.99)

ASLR ( =.99)

LLL ( =.99)

LLL( =.75)

Fig. 9. The average number of floating point operations in FSR-LLL, ASLR

and LLL-aided MMSE detection in m m MIMO system with E N

fixed at 20 dB.

Fig. 11. The linear detection operations in the systolic array. (a)v = QHy

(b)x = R−1v (c)xLR = Q(T · xq).

q, t, r in

Diagonal cell Dii

q, t, r

Off-diagonal cell Oij

inout

in

out out

diagonal celloff-diagonal cell

j=i+1

off-diagonal cell

j>i+1

out iny r out in in

out in

x r y

y y

! "out in in

out in

x r y

y y

! "

(a) (b)

Fig. 13. (a) The data flow and (b) the detailed operations of the cells in the

systolic array for the interference-cancellation step of LR-aided SIC.

Diagonal cell Dii Off-diagonal cell Oij

q, t, rinx outx

iny

outy

q, t, rinx outx

iny

outy

operation diagonal and off-diagonal cells

HQ y"

ˆ q"T x

; out outin in inx x q y y y # "

; out outin in inx x t y y y # "

q, t, rin

Diagonal cell Dii

q, t, r

Off-diagonal cell Oij

inout

in

outout

operation diagonal cell off-diagonal cell

x R vout in in

out in

x r y

y y

! "out iny x r

(a) (b)

Fig. 12. The detailed operations of the diagonal cells and off-diagonal cells in

the systolic array at different stage. (a) and T x (b) R v .

v Q yˆ ˆ( )LR qx T x

zz

zz

(a) (b) (c)

Fig. 11. The linear detection operations in the systolic array. (a) v Q y (b)

x R v (c) ˆ ˆ( )LR q "x T x .

10 15 20 25 3010

-5

10-4

10-3

10-2

10-1

10

/N

Bit-E

rro

r-R

ate

(B

ER

)

ASLR & FSR-LLL(floating-point)

ASLR (fixed-point)

FSR-LLL (fixed-point)

Fig. 10. Comparison between the fixed-point and floating-point lattice

reduction algorithms using ZF-SIC in an 4 4 MIMO system

10 12 14 16

x 10

Avera

ge n

um

ber

of

flops

FSR-LLL ( =.99)

ASLR ( =.99)

LLL ( =.99)

LLL( =.75)

Fig. 9. The average number of floating point operations in FSR-LLL, ASLR

and LLL-aided MMSE detection in m m MIMO system with fixed at

20 dB.

(a)

q, t, r inx

Diagonal cell Dii

q, t, r

Off-diagonal cell Oij

inxout

in

out out

diagonal celloff-diagonal cell

j=i+1

off-diagonal cell

j>i+1

out iny r out in in

out in

x r y

y y

! "!out in in

out in

x r y

y y

! "

(a) (b)

Fig. 13. (a) The data flow and (b) the detailed operations of the cells in the

systolic array for the interference-cancellation step of LR-aided SIC.

Diagonal cell Dii Off-diagonal cell Oij

q, t, rinx out

in

out

q, t, rinx out

in

out

operation diagonal and off-diagonal cells

Q y

T x

; out outin in inx q y y y# "

; out outin in inx t y y y# "

q, t, rinx

Diagonal cell Dii

q, t, r

Off-diagonal cell Oij

inxoutx

iny

outy outy

operation diagonal cell off-diagonal cell

1ˆ ! x R v"

out in in

out in

x x r y

y y

! "

out iny x r

(a) (b)

Fig. 12. The detailed operations of the diagonal cells and off-diagonal cells in

the systolic array at different stage. (a) y and T x (b) R v .

v Q yˆ ˆ( )LR qx T x

zz

zz

(a) (b) (c)

Fig. 11. The linear detection operations in the systolic array. (a) v Q y (b)

x R v (c) ˆ ˆ( )LR q "x T x .

10 15 20 25 3010

-5

10-4

10-3

10-2

10-1

10

/N

Bit-E

rro

r-R

ate

(B

ER

)

ASLR & FSR-LLL(floating-point)

ASLR (fixed-point)

FSR-LLL (fixed-point)

Fig. 10. Comparison between the fixed-point and floating-point lattice

reduction algorithms using ZF-SIC in an 4 4 MIMO system

10 12 14 16

x 10

Avera

ge n

um

ber

of

flops

FSR-LLL ( =.99)

ASLR ( =.99)

LLL ( =.99)

LLL( =.75)

Fig. 9. The average number of floating point operations in FSR-LLL, ASLR

and LLL-aided MMSE detection in m m MIMO system with fixed at

20 dB.

(b)

Fig. 12. The detailed operations of the diagonal cells and off-diagonal cellsin the systolic array at different stage. (a)QHy andT · xq (b)R−1v.

operations forT · xq are shown in Fig. 12(a), and the arrayoutput being quantized to the closest constellation point is thefinal resultxLR of the linear LRAD.

B. Spatial-Interference Cancellation in Systolic Array

The successive spatial-interference cancellation (SIC) canalso be performed on this systolic array with some modi-fications to the PEs. Observing the first step of LR-aidedSIC showing in (7), it should be apparent thatQHy canbe performed in the systolic array in the same fashion as inFig. 11(a) and Fig. 12(a). The second step (8) of LR-aidedSIC can be done in the systolic array as shown in Fig 13. Itis almost the same operations as the one Fig. 12(b), exceptthat we have to do a rounding in the off-diagonal cellsOij

at the super-diagonal position (j = i + 1). The roundingoperations are for the decision of eachzi. Similar to the linearLRAD, the final step of LR-aided SIC is to multiplyz by theunimodular matrixT and bound all the output within the QAMconstellation. It can be done in the same way as in Fig 11(c)and Fig. 12(a), withxq being replaced byz.

Notice that lattice reduction and linear detection (or SIC)areperformed in the same systolic array, and it can be hardware-efficient to share the adder/multiplier/divider designed forlattice reduction processing. For instance, there is one addition,one multiplication, and one division in each diagonal cell,and one addition and one multiplication in each off-diagonalcell for linear detection or SIC, be it ZF or MMSE. Theseoperations are also contained in each cell at the LLL latticereduction stage. For SIC, it seems that we need extra roundingoperations in those off-diagonal cells at the superdiagonalposition. Now, we need those rounding operations in the off-diagonal cells during the full size reduction processing as

Page 12: TO BE APPEARED IN JOURNAL OF COMMUNICATIONS AND …

12 TO BE APPEARED IN JOURNAL OF COMMUNICATIONS AND NETWORKS

z

v1

v2

v3

v4

v

1z

2z

3z

4z q, t, r inx

Diagonal cell Dii

q, t, r

Off-diagonal cell Oij

inxoutx

iny

outy outy

diagonal celloff-diagonal cell

(j=i+1)

off-diagonal cell

(j>i+1)

out iny y r out in in

out in

x x r y

y y

! "

out in in

out in

x x r y

y y

! "

(a) (b)

Fig. 13. (a) The data flow and (b) the detailed operations of the cells in the

systolic array for the interference-cancellation step of LR-aided SIC.

Diagonal cell Dii Off-diagonal cell Oij

q, t, rin out

in

out

q, t, rin out

in

out

operation diagonal and off-diagonal cells

Q y

T x

; out outin in inx q y y y# "

; out outin in inx t y y y# "

q, t, rin

Diagonal cell Dii

q, t, r

Off-diagonal cell Oij

inout

in

outout

operation diagonal cell off-diagonal cell

x R vout in in

out in

x r y

y y

! "out iny x r

(a) (b)

Fig. 12. The detailed operations of the diagonal cells and off-diagonal cells in

the systolic array at different stage. (a) and T x (b) R v .

yv Q y

ˆ ˆ( )LR qx T x

zz

zz

(a) (b) (c)

Fig. 11. The linear detection operations in the systolic array. (a) v Q y (b)

x R v (c) ˆ ˆ( )LR q "x T x .

10 15 20 25 3010

-5

10-4

10-3

10-2

10-1

10

/N

Bit-E

rro

r-R

ate

(B

ER

)

ASLR & FSR-LLL(floating-point)

ASLR (fixed-point)

FSR-LLL (fixed-point)

Fig. 10. Comparison between the fixed-point and floating-point lattice

reduction algorithms using ZF-SIC in an 4 4 MIMO system

10 12 14 16

x 10

Avera

ge n

um

ber

of

flops

FSR-LLL ( =.99)

ASLR ( =.99)

LLL ( =.99)

LLL( =.75)

Fig. 9. The average number of floating point operations in FSR-LLL, ASLR

and LLL-aided MMSE detection in m m MIMO system with fixed at

20 dB.

Fig. 13. The data flow and the detailed operations of the cellsin the systolicarray for the interference-cancellation step of LR-aided SIC.

well. Hence, there need be no extra hardware cost (addersor multipliers) in each cell for linear detection. Only extracontrol logic to the array is needed in order to have each PEwork correctly in different modes.

VI. CONCLUSION

In this paper, we have described a systolic array perform-ing LLL-based lattice-reduction-aided detection for MIMOreceivers. Lattice reduction and the ensuing linear detectionor successive spatial-interference cancellation can be executedby the same array, with minimum global access to eachprocessing element. The proposed systolic array with externallogic controller can work with two different lattice-reductionalgorithms. One is LLL algorithm with full size reduction,which is a different form of the conventional LLL algorithmand more suitable for parallel processing. The second oneis an all-swap complex lattice-reduction algorithm, whichgeneralizes the one originally proposed in [30] for real lattices.Compared to FSR-LLL, ASLR operates on a whole matrix,rather than on its single columns, during the column-swapand Givens-rotation steps. To reduce the complexity of datacommunications between processing elements in the systolicarray, we replace Lovász condition in the LLL algorithm bySiegel condition. Even though Siegel condition is weaker thanLovász condition, the BER performance of LR-aided lineardetections based on our two algorithm versions appears to beas good as using the conventional LLL, and the computationalcomplexity is reduced by the relaxation as well. Based on BERperformance and time-efficiency comparisons, ASLR shouldbe preferred to FSR-LLL, especially for an MIMO systemwith a large number of antennas. The FPGA implementationresults also show that our proposed systolic architecture forlattice reduction algorithms run about1.6× faster than theconventional LLL, at the cost of moderate increases of hard-ware complexity. Additionally, due to the high- throughputproperty of systolic arrays, our design appears very promisingfor high-data-rate systems, such as in a MIMO-OFDM system.

REFERENCES

[1] G. J. Foschini and M. J. Gans, “On limits of wireless communications ina fading environment when using multiple antennas,”Wireless PersonalCommunications, vol. 6, pp. 311–335, 1998.

[2] E. Biglieri, R. Calderbank, A. Constantinides, A. Goldsmith, A. Paulraj,and H. V. Poor,MIMO Wireless Communications. New York, NY,USA: Cambridge University Press, 2007.

[3] Z. Guo and P. Nilsson, “A VLSI implementation of MIMO detection forfuture wireless communications,” inProc. IEEE Personal, Indoor andMobile Radio Communications, vol. 3, 2003, pp. 2852–2856.

[4] M. Myllyla, J. Hintikka, J. Cavallaro, M. Juntti, M. Limingoja, andA. Byman, “Complexity analysis of MMSE detector architectures forMIMO OFDM systems,” inProc. the Thirty-Ninth Asilomar Conferenceon Signals, Systems and Computers, 2005, pp. 75–81.

[5] M. Karkooti, J. Cavallaro, and C. Dick, “FPGA implementation of ma-trix inversion using QRD-RLS algorithm,” inProc. Asilomar Conferenceon Signals, Systems and Computers, 2005, pp. 1625–1629.

[6] H. Yao and G. Wornell, “Lattice-reduction-aided detectors for MIMOcommunication systems,” inIEEE Global Telecommunications Confer-ence, GLOBECOM, vol. 1, 2002, pp. 424–428.

[7] D. Seethaler, G. Matz, and F. Hlawatsch, “Low-Complexity MIMO datadetection using seysen’s lattice reduction algorithm,” inProc. IEEEInternational Conference on Acoustics, Speech and Signal Processing,ICASSP, vol. 3, 2007, pp. III–53–III–56.

[8] D. Wübben, R. Böhnke, V. Kühn, and K.-D. Kammeyer, “Near-maximum-likelihood detection of MIMO systems using MMSE-basedlattice reduction,” inProc. IEEE International Conference on Commu-nications, vol. 2, 2004, pp. 798–802.

[9] A. K. Lenstra, H. W. Lenstra, and L. Lovász, “Factoring polynomialswith rational coefficients,”Mathematische Annalen, vol. 261, no. 4, pp.515–534, 1982.

[10] Y. H. Gan, C. Ling, and W. H. Mow, “Complex lattice reductionalgorithm for Low-Complexity Full-Diversity MIMO detection,” IEEETrans. on Signal Processing, vol. 57, no. 7, pp. 2701–2710, 2009.

[11] X. Ma and W. Zhang, “Performance analysis for MIMO systems withlattice-reduction aided linear equalization,”IEEE Trans. on Communi-cations, vol. 56, no. 2, pp. 309–318, 2008.

[12] M. Taherzadeh, A. Mobasher, and A. Khandani, “LLL reductionachieves the receive diversity in MIMO decoding,”IEEE Trans. onInform. Theory, vol. 53, no. 12, pp. 4801–4805, 2007.

[13] J. Jaldén, D. Seethaler, and G. Matz, “Worst- and average-case complex-ity of LLL lattice reduction in MIMO wireless systems,” inProc. IEEEInternational Conference on Acoustics, Speech and Signal Processing,ICASSP, 2008, pp. 2685–2688.

[14] B. Gestner, W. Zhang, X. Ma, and D. Anderson, “VLSI implementationof a lattice reduction algorithm for Low-Complexity equalization,”in Proc. IEEE International Conference on Circuits and Systems forCommunications, ICCSC, 2008, pp. 643–647.

[15] C. P. Schnorr and M. Euchner, “Lattice basis reduction:Improvedpractical algorithms and solving subset sum problems,”MathematicalProgramming, vol. 66, no. 1-3, pp. 181–199, 1994.

[16] J. Jaldén and P. Elia, “DMT optimality of LR-Aided linear decoders fora general class of channels, lattice designs, and system models,” IEEETrans. on Information Theory, vol. 56, no. 10, pp. 4765–4780, 2010.

[17] H. Vetter, V. Ponnampalam, M. Sandell, and P. Hoeher, “Fixed complex-ity LLL algorithm,” IEEE Trans. on Signal Processing, vol. 57, no. 4,pp. 1634–1637, 2009.

[18] H. T. Kung and C. E. Leiserson, “Algorithms for VLSI processor arrays,”in Introduction to VLSI Systems. Addison-Wesley, 1980, p. 271.

[19] S. Y. Kung, “VLSI array processors,”IEEE ASSP Magazine, vol. 2,no. 3, pp. 4–22, 1985.

[20] W. M. Gentleman and H. T. Kung, “Matrix triangulation bysystolicarrays,” in Proc. of SPIE: Real-time Signal Processing IV, vol. 298,1981, pp. 19–26.

[21] A. El-Amawy and K. Dharmarajan, “Parallel VLSI algorithm for stableinversion of dense matrices,”IEE Proc. Computers and Digital Tech-niques, vol. 136, no. 6, pp. 575–580, 1989.

[22] C. Rader, “VLSI systolic arrays for adaptive nulling,”IEEE SignalProcessing Magazine, vol. 13, no. 4, pp. 29–49, 1996.

[23] K. Liu, S.-F. Hsieh, K. Yao, and C.-T. Chiu, “Dynamic range, stability,and fault-tolerant capability of finite-precision RLS systolic array basedon givens rotations,”IEEE Trans. on Circuits and Systems, vol. 38, no. 6,pp. 625–636, 1991.

[24] D. Boppana, K. Dhanoa, and J. Kempa, “FPGA based embedded pro-cessing architecture for the QRD-RLS algorithm,” inProc. IEEE Sym-posium on Field-Programmable Custom Computing Machines, vol. 0,2004, pp. 330–331.

[25] K. Yao and F. Lorenzelli, “Systolic algorithms and architectures forHigh-Throughput processing applications,”Journal of Signal ProcessingSystems, vol. 53, no. 1-2, pp. 15–34, 2008.

[26] J. Wang and B. Daneshrad, “A universal systolic array for linear MIMOdetections,” inProc. IEEE Wireless Communications and NetworkingConference, 2008, pp. 147–152.

Page 13: TO BE APPEARED IN JOURNAL OF COMMUNICATIONS AND …

WANG et al.: SYSTOLIC ARRAYS FOR LATTICE-REDUCTION-AIDED MIMO DETECTION 13

[27] K. Seki, T. Kobori, J. Okello, and M. Ikekawa, “A CORDIC-Basedreconfigrable systolic array processor for MIMO-OFDM wireless com-munications,” inIEEE Workshop on Signal Processing Systems, 2007,pp. 639–644.

[28] Y. Hu, “CORDIC-based VLSI architectures for digital signal process-ing,” IEEE Signal Processing Magazine, vol. 9, no. 3, pp. 16–35, 1992.

[29] B. Cerato, G. Masera, and P. Nilsson, “Hardware architecture for matrixfactorization in mimo receivers,” inProc. ACM Great Lakes symposiumon VLSI, New York, NY, USA, 2007, p. 19699.

[30] C. Heckler and L. Thiele, “A parallel lattice basis reduction for mesh-connected processor arrays and parallel complexity,” inProc. IEEESymposium on Parallel and Distributed Processing, 1993, pp. 400–407.

[31] J. W. S. Cassels,Rational quadratic forms. London; New York:Academic Press, 1978.

[32] B. Hassibi, “An efficient square-root algorithm for BLAST,” in Proc.IEEE International Conference on Acoustics, Speech, and Signal Pro-cessing, vol. 2, 2000, pp. II737–II740.

[33] E. Agrell, T. Eriksson, A. Vardy, and K. Zeger, “Closestpoint search inlattices,” IEEE Trans. on Inform. Theory, vol. 48, no. 8, pp. 2201–2214,2002.

[34] L. Babai, “On lovász’ lattice reduction and the nearestlattice pointproblem,” Combinatorica, vol. 6, no. 1, pp. 1–13, 1986.

[35] R. Döhler, “Squared givens rotation,”IMA Journal of Numerical Anal-ysis, vol. 11, no. 1, pp. 1 –5, Jan. 1991.

[36] P. Luethi, A. Burg, S. Haene, D. Perels, N. Felber, and W.Fichtner,“VLSI implementation of a High-Speed iterative sorted MMSEQRdecomposition,” inProc. IEEE International Symposium on Circuits andSystems, 2007, pp. 1421–1424.

[37] C. V. Ramamoorthy, J. R. Goodman, and K. H. Kim, “Some propertiesof iterative Square-Rooting methods using High-Speed multiplication,”IEEE Trans. on Computers, vol. C-21, no. 8, pp. 837–847, 1972.

[38] F. Lorenzelli, P. Hansen, T. Chan, and K. Yao, “A systolic implemen-tation of the Chan/Foster RRQR algorithm,”IEEE Trans. on SignalProcessing, vol. 42, no. 8, pp. 2205–2208, 1994.