approximateneumann seriesorexactmatrix inversionformassive...
TRANSCRIPT
Approximate NeumannSeries or Exact MatrixInversion for MassiveMIMO?Oscar Gustafsson, Erik Bertilsson, JohannesKlasson, and Carl Ingemarsson
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 1
Matrix Inversion in Massive MIMO
• N terminals,M antennas
• Channel matrix,H ∈ CM×N
• Gram matrix,X = HHH ∈ CN×N to be inverted
for zero forcing (or MMSE)
• X: conjugate symmetric (Hermitian) and
semi-definite
• X: with uncorrelated channels andM � N ,
diagonally dominant
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 1
Matrix Inversion in Massive MIMO
• N terminals,M antennas
• Channel matrix,H ∈ CM×N
• Gram matrix,X = HHH ∈ CN×N to be inverted
for zero forcing (or MMSE)
• X: conjugate symmetric (Hermitian) and
semi-definite
• X: with uncorrelated channels andM � N ,
diagonally dominant
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 1
Matrix Inversion in Massive MIMO
• N terminals,M antennas
• Channel matrix,H ∈ CM×N
• Gram matrix,X = HHH ∈ CN×N to be inverted
for zero forcing (or MMSE)
• X: conjugate symmetric (Hermitian) and
semi-definite
• X: with uncorrelated channels andM � N ,
diagonally dominant
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 2
Matrix Inversion in Massive MIMO
PUL UL UL UL DLG DL G
Tframe
NUL,1 NUL,2 NDL
• One matrix inversion per frame
• Computed between reception of pilot and
transmission of first downlink data
• Latency, not throughput
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 2
Matrix Inversion in Massive MIMO
PUL UL UL UL DLG DL G
Tframe
NUL,1 NUL,2 NDL
• One matrix inversion per frame
• Computed between reception of pilot and
transmission of first downlink data
• Latency, not throughput
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 2
Matrix Inversion in Massive MIMO
PUL UL UL UL DLG DL G
Tframe
NUL,1 NUL,2 NDL
• One matrix inversion per frame
• Computed between reception of pilot and
transmission of first downlink data
• Latency, not throughput
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 3
Algorithms for Matrix Inversion
• Exact algorithms
• Numerical issues, especially in fixed-point, forclose to singular (sub-)matrices
• Division and/or square-roots• Cubic complexity
• LDLᵀ-decomposition
• Lowest operation count• Reasonable fixed-point properties• No square-roots
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 3
Algorithms for Matrix Inversion
• Exact algorithms
• Numerical issues, especially in fixed-point, forclose to singular (sub-)matrices
• Division and/or square-roots• Cubic complexity
• LDLᵀ-decomposition
• Lowest operation count• Reasonable fixed-point properties• No square-roots
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 4
Algorithms for Matrix Inversion
• Neumann series expansion
• Precondition matrixA ≈ X−1
X̂−1K =
(K∑
n=1
(I−AX)n−1
)A, (1)
• “High parallelism”
• “Low complexity”
• “No division”
• “Numerically stable”
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 4
Algorithms for Matrix Inversion
• Neumann series expansion
• Precondition matrixA ≈ X−1
X̂−1K =
(K∑
n=1
(I−AX)n−1
)A, (1)
• “High parallelism”
• “Low complexity”
• “No division”
• “Numerically stable”
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 5
Algorithms for Matrix Inversion
Diagonal precondition matrix
A =
a1,1 0 · · · 00 a2,2 . . . 0...
. . ....
...
0 0 · · · aN,N
ai,i = 1/xi,i
I−AX =
0 y1,2 · · · y1,N
y2,1 0 . . . y2,N...
. . ....
...yN,1 yN,2 · · · 0
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 5
Algorithms for Matrix Inversion
Diagonal precondition matrix
A =
a1,1 0 · · · 00 a2,2 . . . 0...
. . ....
...
0 0 · · · aN,N
ai,i = 1/xi,i
I−AX =
0 y1,2 · · · y1,N
y2,1 0 . . . y2,N...
. . ....
...yN,1 yN,2 · · · 0
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 6
Algorithms for Matrix Inversion
Tri-diagonal precondition matrix
A =
a1,1 a1,2 0 · · · 0a2,1 a2,2 a2,3 . . . 00 a3,2 a3,3 . . . 0...
. . ....
...
0 0 0 · · · aN,N
Sequential computation ofAGeneric I−AX
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 6
Algorithms for Matrix Inversion
Tri-diagonal precondition matrix
A =
a1,1 a1,2 0 · · · 0a2,1 a2,2 a2,3 . . . 00 a3,2 a3,3 . . . 0...
. . ....
...
0 0 0 · · · aN,N
Sequential computation ofAGeneric I−AX
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 7
Algorithms for Matrix Inversion
Diagonal + column precondition matrix
A =
a1,1 0 · · · 0a2,1 a2,2 . . . 0...
. . ....
...
aN,1 0 · · · aN,N
I−AX =
0 y1,2 · · · y1,N0 y2,2bb . . . y2,N...
. . ....
...
0 yN,2 · · · yN,N
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 7
Algorithms for Matrix Inversion
Diagonal + column precondition matrix
A =
a1,1 0 · · · 0a2,1 a2,2 . . . 0...
. . ....
...
aN,1 0 · · · aN,N
I−AX =
0 y1,2 · · · y1,N0 y2,2bb . . . y2,N...
. . ....
...
0 yN,2 · · · yN,N
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 8
Computational Complexity
• The latency (time to obtain the result) of analgorithm depends on two aspects:
• Total number of operations→ latency scales withnumber of processing elements (PEs)
• Number of sequential operations→ latency doesnot scale with number of PEs
• Pipelining of the PEs
• Increases clock frequency• Increases latency
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 8
Computational Complexity
• The latency (time to obtain the result) of analgorithm depends on two aspects:
• Total number of operations→ latency scales withnumber of processing elements (PEs)
• Number of sequential operations→ latency doesnot scale with number of PEs
• Pipelining of the PEs
• Increases clock frequency• Increases latency
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 8
Computational Complexity
• The latency (time to obtain the result) of analgorithm depends on two aspects:
• Total number of operations→ latency scales withnumber of processing elements (PEs)
• Number of sequential operations→ latency doesnot scale with number of PEs
• Pipelining of the PEs
• Increases clock frequency• Increases latency
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 8
Computational Complexity
• The latency (time to obtain the result) of analgorithm depends on two aspects:
• Total number of operations→ latency scales withnumber of processing elements (PEs)
• Number of sequential operations→ latency doesnot scale with number of PEs
• Pipelining of the PEs
• Increases clock frequency• Increases latency
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 9
Computational Complexity Example
4× 4 exact matrix inversion based on LDLᵀ
-
-
-
-
- -
-
- - -
-
-
-
--
- -
-
-
-
--
-
-
- -
- - -
-
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 10
How Many Cycles?
• Assume multiply-and-add (MAD) operations
• Reciprocals performed using Newton-Raphson→a number of sequential MAD operations
• Sum-of-products computed using sequential
MADs
• O operations, each with P pipeline stages
implemented on Q processing elements (PEs)
require
Calg ≥ max
{⌈O
Q
⌉+ P − 1, PClatency
}cycles. (2)
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 10
How Many Cycles?
• Assume multiply-and-add (MAD) operations
• Reciprocals performed using Newton-Raphson→a number of sequential MAD operations
• Sum-of-products computed using sequential
MADs
• O operations, each with P pipeline stages
implemented on Q processing elements (PEs)
require
Calg ≥ max
{⌈O
Q
⌉+ P − 1, PClatency
}cycles. (2)
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 10
How Many Cycles?
• Assume multiply-and-add (MAD) operations
• Reciprocals performed using Newton-Raphson→a number of sequential MAD operations
• Sum-of-products computed using sequential
MADs
• O operations, each with P pipeline stages
implemented on Q processing elements (PEs)
require
Calg ≥ max
{⌈O
Q
⌉+ P − 1, PClatency
}cycles. (2)
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 11
Algorithm Comparison – Complexity
Method MADs Reciprocals
Exact method
LDLᵀ+EQU 12N
3 + 12N
2 −N N
Neumann series
Diagonal,K = 2 N2 −N NK = 3 1
2N3 +N2 − 1
2N N
Tri-diagonals,K = 2 3N2 + 7N − 10 2N − 1K = 3 1
2N3 + 6N2 + 1
2N − 2 2N − 1
Diag. + column,K = 2 32N
2 + 52N − 4 N
K = 3 12N
3 + 52N
2 − 2N − 1 N
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 12
Algorithm Comparison – Latency
Method MADs Reciprocals
Exact method
LDLᵀ+EQU 4N − 4 N
Neumann series
Diagonal,K = 2 2 1K = 3 N + 1 1
Tri-diagonals,K = 2 2N + 5 NK = 3 3N + 5 N
Diag. + column,K = 2 N + 2 1K = 3 2N + 1 1
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 13
Results
Bit-error rate for the four approaches,N = 20,M = 120
0 1 2 3 4 510-8
10-6
10-4
10-2
100
DiagonalColumn DiagonalTridiagonalLDL
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 14
Results
Reciprocal⇒ Three sequential MAD operations
4× 4-matrix#PE: 1, latency: 48
20 40Cycle
0
0.5
1#O
pera
tions
#PE: 2, latency: 29
5 10 15 20 25Cycle
0
1
2
#Ope
ratio
ns
#PE: 3, latency: 26
5 10 15 20 25Cycle
0
2
4
#Ope
ratio
ns
#PE: 4, latency: 25
5 10 15 20 25Cycle
0
2
4
#Ope
ratio
ns
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 15
Results – 16× 16
Solid: actual result, dashed: from equation
5 10 15Processing elements
102
103
104C
ycle
sTri-diagonalCol. + Diag.DiagonalExact
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 16
Results – 8× 8
Solid: actual result, dashed: from equation
5 10 15Processing elements
101
102
103C
ycle
sCol. + Diag.DiagonalExact
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 17
Results
With P = 1, 2, 3, 4 levels of pipelining4× 4-matrix
P: 1, latency: 48
20 40Cycle
0
0.5
1#O
pera
tions
P: 2, latency: 57
10 20 30 40 50Cycle
0
0.5
1
#Ope
ratio
ns
P: 3, latency: 77
20 40 60Cycle
0
0.5
1
#Ope
ratio
ns
P: 4, latency: 98
20 40 60 80Cycle
0
0.5
1#O
pera
tions
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 18
Results – 16× 16
Time in single cycle latency operations, assuming
pipelining increases speed linearly
Solid: P = 1, dashed: P = 2, dash-dotted: P = 3
1 2 3 4Processing elements
102
103
Tim
e
Col. + Diag.DiagonalExact
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 19
Results – 8× 8
Time in single cycle latency operations, assuming
pipelining increases speed linearly
Solid: P = 1, dashed: P = 2, dash-dotted: P = 3
1 2 3 4Processing elements
101
102
Tim
e
Col. + Diag.DiagonalExact
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 20
Design Example
• Assume a latency requirement of 0.05 ms (10% of
an LTE-like frame with 2 UL and 2 DL symbols)
• For N = 8 and one PE, 304 cycles are required forthe exact algorithm
• One PE operating at fclk = 6.08MHz
• N = 30 ⇒ fclk ≈ 280MHz
• 2 kInv/s, idle 90% of the time
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 20
Design Example
• Assume a latency requirement of 0.05 ms (10% of
an LTE-like frame with 2 UL and 2 DL symbols)
• For N = 8 and one PE, 304 cycles are required forthe exact algorithm
• One PE operating at fclk = 6.08MHz
• N = 30 ⇒ fclk ≈ 280MHz
• 2 kInv/s, idle 90% of the time
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 20
Design Example
• Assume a latency requirement of 0.05 ms (10% of
an LTE-like frame with 2 UL and 2 DL symbols)
• For N = 8 and one PE, 304 cycles are required forthe exact algorithm
• One PE operating at fclk = 6.08MHz
• N = 30 ⇒ fclk ≈ 280MHz
• 2 kInv/s, idle 90% of the time
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 20
Design Example
• Assume a latency requirement of 0.05 ms (10% of
an LTE-like frame with 2 UL and 2 DL symbols)
• For N = 8 and one PE, 304 cycles are required forthe exact algorithm
• One PE operating at fclk = 6.08MHz
• N = 30 ⇒ fclk ≈ 280MHz
• 2 kInv/s, idle 90% of the time
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 20
Design Example
• Assume a latency requirement of 0.05 ms (10% of
an LTE-like frame with 2 UL and 2 DL symbols)
• For N = 8 and one PE, 304 cycles are required forthe exact algorithm
• One PE operating at fclk = 6.08MHz
• N = 30 ⇒ fclk ≈ 280MHz
• 2 kInv/s, idle 90% of the time
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 21
Is Neumann useful at all?
• If less than three terms are used, the complexity
may be lower
• Only compute parts of the third iteration
• Allow increasing the number of terminals further
• But numerically most efficient when the ratio
between number of antennas and terminals is high
• May give a better result with singular or close to
singular matrices (not correct result maybe not as
bad as an exact algorithm)
• (Really) large matrices
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 21
Is Neumann useful at all?
• If less than three terms are used, the complexity
may be lower
• Only compute parts of the third iteration
• Allow increasing the number of terminals further
• But numerically most efficient when the ratio
between number of antennas and terminals is high
• May give a better result with singular or close to
singular matrices (not correct result maybe not as
bad as an exact algorithm)
• (Really) large matrices
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 21
Is Neumann useful at all?
• If less than three terms are used, the complexity
may be lower
• Only compute parts of the third iteration
• Allow increasing the number of terminals further
• But numerically most efficient when the ratio
between number of antennas and terminals is high
• May give a better result with singular or close to
singular matrices (not correct result maybe not as
bad as an exact algorithm)
• (Really) large matrices
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 21
Is Neumann useful at all?
• If less than three terms are used, the complexity
may be lower
• Only compute parts of the third iteration
• Allow increasing the number of terminals further
• But numerically most efficient when the ratio
between number of antennas and terminals is high
• May give a better result with singular or close to
singular matrices (not correct result maybe not as
bad as an exact algorithm)
• (Really) large matrices
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 22
Conclusions
• Latency, not throughput
• Complexity for Neumann series withK = 3 higherthan best exact algorithm
• Few terms for Neumann when diagonallydominant• Diagonally dominant⇒ well conditioned⇒ exact
algorithm behaves well• Few terminals⇒more diagonally dominant⇒
fewer Neumann terms (but also less complexity forexact algorithm)
• With few PEs compared to matrix size, the limited
parallelism of the exact algorithm is no problem• Required latency/parallelism determined by frame
structure
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 22
Conclusions
• Latency, not throughput• Complexity for Neumann series withK = 3 higherthan best exact algorithm
• Few terms for Neumann when diagonallydominant• Diagonally dominant⇒ well conditioned⇒ exact
algorithm behaves well• Few terminals⇒more diagonally dominant⇒
fewer Neumann terms (but also less complexity forexact algorithm)
• With few PEs compared to matrix size, the limited
parallelism of the exact algorithm is no problem• Required latency/parallelism determined by frame
structure
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 22
Conclusions
• Latency, not throughput• Complexity for Neumann series withK = 3 higherthan best exact algorithm
• Few terms for Neumann when diagonallydominant
• Diagonally dominant⇒ well conditioned⇒ exactalgorithm behaves well
• Few terminals⇒more diagonally dominant⇒fewer Neumann terms (but also less complexity forexact algorithm)
• With few PEs compared to matrix size, the limited
parallelism of the exact algorithm is no problem• Required latency/parallelism determined by frame
structure
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 22
Conclusions
• Latency, not throughput• Complexity for Neumann series withK = 3 higherthan best exact algorithm
• Few terms for Neumann when diagonallydominant• Diagonally dominant⇒ well conditioned
⇒ exactalgorithm behaves well
• Few terminals⇒more diagonally dominant⇒fewer Neumann terms (but also less complexity forexact algorithm)
• With few PEs compared to matrix size, the limited
parallelism of the exact algorithm is no problem• Required latency/parallelism determined by frame
structure
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 22
Conclusions
• Latency, not throughput• Complexity for Neumann series withK = 3 higherthan best exact algorithm
• Few terms for Neumann when diagonallydominant• Diagonally dominant⇒ well conditioned⇒ exact
algorithm behaves well• Few terminals⇒more diagonally dominant
⇒fewer Neumann terms (but also less complexity forexact algorithm)
• With few PEs compared to matrix size, the limited
parallelism of the exact algorithm is no problem• Required latency/parallelism determined by frame
structure
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 22
Conclusions
• Latency, not throughput• Complexity for Neumann series withK = 3 higherthan best exact algorithm
• Few terms for Neumann when diagonallydominant• Diagonally dominant⇒ well conditioned⇒ exact
algorithm behaves well• Few terminals⇒more diagonally dominant⇒
fewer Neumann terms (but also less complexity forexact algorithm)
• With few PEs compared to matrix size, the limited
parallelism of the exact algorithm is no problem• Required latency/parallelism determined by frame
structure
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 22
Conclusions
• Latency, not throughput• Complexity for Neumann series withK = 3 higherthan best exact algorithm
• Few terms for Neumann when diagonallydominant• Diagonally dominant⇒ well conditioned⇒ exact
algorithm behaves well• Few terminals⇒more diagonally dominant⇒
fewer Neumann terms (but also less complexity forexact algorithm)
• With few PEs compared to matrix size, the limited
parallelism of the exact algorithm is no problem
• Required latency/parallelism determined by frame
structure
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 22
Conclusions
• Latency, not throughput• Complexity for Neumann series withK = 3 higherthan best exact algorithm
• Few terms for Neumann when diagonallydominant• Diagonally dominant⇒ well conditioned⇒ exact
algorithm behaves well• Few terminals⇒more diagonally dominant⇒
fewer Neumann terms (but also less complexity forexact algorithm)
• With few PEs compared to matrix size, the limited
parallelism of the exact algorithm is no problem• Required latency/parallelism determined by frame
structure