approximateneumann seriesorexactmatrix inversionformassive...

Approximate NeumannSeries or Exact MatrixInversion for MassiveMIMO?Oscar Gustafsson, Erik Bertilsson, JohannesKlasson, and Carl Ingemarsson

Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 1

Matrix Inversion in Massive MIMO

• N terminals,M antennas

• Channel matrix,H ∈ CM×N

• Gram matrix,X = HHH ∈ CN×N to be inverted

for zero forcing (or MMSE)

• X: conjugate symmetric (Hermitian) and

semi-definite

• X: with uncorrelated channels andM � N ,

diagonally dominant

semi-definite

diagonally dominant

semi-definite

diagonally dominant

PUL UL UL UL DLG DL G

Tframe

NUL,1 NUL,2 NDL

• One matrix inversion per frame

• Computed between reception of pilot and

transmission of first downlink data

• Latency, not throughput

Tframe

NUL,1 NUL,2 NDL

Tframe

NUL,1 NUL,2 NDL

Algorithms for Matrix Inversion

• Exact algorithms

• Numerical issues, especially in fixed-point, forclose to singular (sub-)matrices

• Division and/or square-roots• Cubic complexity

• LDLᵀ-decomposition

• Lowest operation count• Reasonable fixed-point properties• No square-roots

• Exact algorithms

• Numerical issues, especially in fixed-point, forclose to singular (sub-)matrices

• Division and/or square-roots• Cubic complexity

• LDLᵀ-decomposition

• Lowest operation count• Reasonable fixed-point properties• No square-roots

• Neumann series expansion

• Precondition matrixA ≈ X−1

X̂−1K =

(I−AX)n−1

)A, (1)

• “High parallelism”

• “Low complexity”

• “No division”

• “Numerically stable”

• Neumann series expansion

• Precondition matrixA ≈ X−1

X̂−1K =

(I−AX)n−1

)A, (1)

• “High parallelism”

• “Low complexity”

• “No division”

• “Numerically stable”

Diagonal precondition matrix

a1,1 0 · · · 00 a2,2 . . . 0...

. . ....

0 0 · · · aN,N

ai,i = 1/xi,i

I−AX =

0 y1,2 · · · y1,N

y2,1 0 . . . y2,N...

. . ....

...yN,1 yN,2 · · · 0

Diagonal precondition matrix

a1,1 0 · · · 00 a2,2 . . . 0...

. . ....

0 0 · · · aN,N

ai,i = 1/xi,i

I−AX =

0 y1,2 · · · y1,N

y2,1 0 . . . y2,N...

. . ....

...yN,1 yN,2 · · · 0

Tri-diagonal precondition matrix

a1,1 a1,2 0 · · · 0a2,1 a2,2 a2,3 . . . 00 a3,2 a3,3 . . . 0...

. . ....

0 0 0 · · · aN,N

Sequential computation ofAGeneric I−AX

Tri-diagonal precondition matrix

a1,1 a1,2 0 · · · 0a2,1 a2,2 a2,3 . . . 00 a3,2 a3,3 . . . 0...

. . ....

0 0 0 · · · aN,N

Sequential computation ofAGeneric I−AX

Diagonal + column precondition matrix

a1,1 0 · · · 0a2,1 a2,2 . . . 0...

. . ....

aN,1 0 · · · aN,N

I−AX =

0 y1,2 · · · y1,N0 y2,2bb . . . y2,N...

. . ....

0 yN,2 · · · yN,N

Diagonal + column precondition matrix

a1,1 0 · · · 0a2,1 a2,2 . . . 0...

. . ....

aN,1 0 · · · aN,N

I−AX =

0 y1,2 · · · y1,N0 y2,2bb . . . y2,N...

. . ....

0 yN,2 · · · yN,N

Computational Complexity

• The latency (time to obtain the result) of analgorithm depends on two aspects:

• Total number of operations→ latency scales withnumber of processing elements (PEs)

• Number of sequential operations→ latency doesnot scale with number of PEs

• Pipelining of the PEs

• Increases clock frequency• Increases latency

Computational Complexity Example

4× 4 exact matrix inversion based on LDLᵀ

How Many Cycles?

• Assume multiply-and-add (MAD) operations

• Reciprocals performed using Newton-Raphson→a number of sequential MAD operations

• Sum-of-products computed using sequential

• O operations, each with P pipeline stages

implemented on Q processing elements (PEs)

require

Calg ≥ max

⌉+ P − 1, PClatency

}cycles. (2)

How Many Cycles?

require

Calg ≥ max

}cycles. (2)

How Many Cycles?

require

Calg ≥ max

}cycles. (2)

Algorithm Comparison – Complexity

Method MADs Reciprocals

Exact method

LDLᵀ+EQU 12N

3 + 12N

2 −N N

Neumann series

Diagonal,K = 2 N2 −N NK = 3 1

2N3 +N2 − 1

Tri-diagonals,K = 2 3N2 + 7N − 10 2N − 1K = 3 1

2N3 + 6N2 + 1

2N − 2 2N − 1

Diag. + column,K = 2 32N

2 + 52N − 4 N

K = 3 12N

3 + 52N

2 − 2N − 1 N

Algorithm Comparison – Latency

Method MADs Reciprocals

Exact method

LDLᵀ+EQU 4N − 4 N

Neumann series

Diagonal,K = 2 2 1K = 3 N + 1 1

Tri-diagonals,K = 2 2N + 5 NK = 3 3N + 5 N

Diag. + column,K = 2 N + 2 1K = 3 2N + 1 1

Results

Bit-error rate for the four approaches,N = 20,M = 120

0 1 2 3 4 510-8

DiagonalColumn DiagonalTridiagonalLDL

Results

Reciprocal⇒ Three sequential MAD operations

4× 4-matrix#PE: 1, latency: 48

20 40Cycle

#PE: 2, latency: 29

5 10 15 20 25Cycle

#PE: 3, latency: 26

5 10 15 20 25Cycle

#PE: 4, latency: 25

5 10 15 20 25Cycle

Results – 16× 16

Solid: actual result, dashed: from equation

5 10 15Processing elements

sTri-diagonalCol. + Diag.DiagonalExact

Results – 8× 8

Solid: actual result, dashed: from equation

5 10 15Processing elements

sCol. + Diag.DiagonalExact

Results

With P = 1, 2, 3, 4 levels of pipelining4× 4-matrix

P: 1, latency: 48

20 40Cycle

P: 2, latency: 57

10 20 30 40 50Cycle

P: 3, latency: 77

20 40 60Cycle

P: 4, latency: 98

20 40 60 80Cycle

Results – 16× 16

Time in single cycle latency operations, assuming

pipelining increases speed linearly

Solid: P = 1, dashed: P = 2, dash-dotted: P = 3

1 2 3 4Processing elements

Col. + Diag.DiagonalExact

Results – 8× 8

Time in single cycle latency operations, assuming

pipelining increases speed linearly

Solid: P = 1, dashed: P = 2, dash-dotted: P = 3

1 2 3 4Processing elements

Col. + Diag.DiagonalExact

Design Example

• Assume a latency requirement of 0.05 ms (10% of

an LTE-like frame with 2 UL and 2 DL symbols)

• For N = 8 and one PE, 304 cycles are required forthe exact algorithm

• One PE operating at fclk = 6.08MHz

• N = 30 ⇒ fclk ≈ 280MHz

• 2 kInv/s, idle 90% of the time

Design Example

• N = 30 ⇒ fclk ≈ 280MHz

Design Example

• N = 30 ⇒ fclk ≈ 280MHz

Design Example

• N = 30 ⇒ fclk ≈ 280MHz

Design Example

• N = 30 ⇒ fclk ≈ 280MHz

Is Neumann useful at all?

• If less than three terms are used, the complexity

may be lower

• Only compute parts of the third iteration

• Allow increasing the number of terminals further

• But numerically most efficient when the ratio

between number of antennas and terminals is high

• May give a better result with singular or close to

singular matrices (not correct result maybe not as

bad as an exact algorithm)

• (Really) large matrices

may be lower

Conclusions

• Complexity for Neumann series withK = 3 higherthan best exact algorithm

• Few terms for Neumann when diagonallydominant• Diagonally dominant⇒ well conditioned⇒ exact

algorithm behaves well• Few terminals⇒more diagonally dominant⇒

fewer Neumann terms (but also less complexity forexact algorithm)

• With few PEs compared to matrix size, the limited

parallelism of the exact algorithm is no problem• Required latency/parallelism determined by frame

structure

Conclusions

• Latency, not throughput• Complexity for Neumann series withK = 3 higherthan best exact algorithm

structure

Conclusions

• Few terms for Neumann when diagonallydominant

• Diagonally dominant⇒ well conditioned⇒ exactalgorithm behaves well

• Few terminals⇒more diagonally dominant⇒fewer Neumann terms (but also less complexity forexact algorithm)

structure

Conclusions

• Few terms for Neumann when diagonallydominant• Diagonally dominant⇒ well conditioned

⇒ exactalgorithm behaves well

• Few terminals⇒more diagonally dominant⇒fewer Neumann terms (but also less complexity forexact algorithm)

structure

Conclusions

algorithm behaves well• Few terminals⇒more diagonally dominant

⇒fewer Neumann terms (but also less complexity forexact algorithm)

structure

Conclusions

structure

Conclusions

parallelism of the exact algorithm is no problem

• Required latency/parallelism determined by frame

structure

Conclusions

structure

Thank you!Questions?

www.liu.se

approximateneumann seriesorexactmatrix inversionformassive...

Documents