Massive Parallel LDPC Decoding on GPU
Gabriel Falcão, Leonel Sousa, Vitor Silva
Univ. of Coimbra and T. Univ. of Lisbon, Portugal
Salt Lake City, Feb 21st 2008 PPoPP’08 2
MOTIVATION
LDPC Decoding Intensive computation Irregular accesses to memory
LDPC decoding using VLSI dedicated hardware Low area, low power consumption High throughputs (Mbps) and low latency Fixed-point arithmetic
LDPC decoding on GPUs GPUs processing horse power available CUDA programming interface Medium to high throughputs (Mbps) Floating-point arithmetic Software based flexible solution!
Salt Lake City, Feb 21st 2008 PPoPP’08 3
OUTLINE
Motivation LDPC codes Bit Node processing (BN) Check Node processing (CN) GPUs CUDA interface Experimental results Conclusions and future work
Salt Lake City, Feb 21st 2008 PPoPP’08 4
LDPC CODES
Advantages: Linear block codes Perform close to Shannon limit capacity High throughputs (Mbps) Very low Bit Error Rate (BER)
Disadvantages: Good performance implies large H matrices Computationally intensive operations Large amounts of hardware VLSI dedicated solutions are expensive Bottom line: Why not using the horse power available on
GPUs, instead of developing expensive VLSI?
Salt Lake City, Feb 21st 2008 PPoPP’08 5
LDPC CODES
Parity check matrix defines the LDPC code
Tanner Graph represents connections between BNs and CNs
1BN 2BN 3BN 4BN 5BN
Checknode
BitnodeBNi
0BN
0CN 1CN 2CNCNi
1 1 0 0 1 0
0 1 1 0 0 1
1 0 1 1 0 0
H CN1
BN1
Salt Lake City, Feb 21st 2008 PPoPP’08 6
LDPC DECODER
BNs and CNs exchange messages (i.e., probabilities) allowing reliable decision on a bit value
m n Message sent from CN to BNmnr
n mMessage sent from BN to CNnmq
1BN 2BN 3BN 4BN 5BN
Checknode
BitnodeBNi
0BN
0CN 1CN 2CNCNi
Salt Lake City, Feb 21st 2008 PPoPP’08 7
CHECK NODE PROCESSING - CN
1. Calculates message going from CNm to BNn:
BNi
BNj
BNk
BNn
q im
qjm
qkm
r mn
CNm
1
'' ( )\
1 1(0) (1 2 (1))
2 2
i i
mn n mn N m n
r q
Salt Lake City, Feb 21st 2008 PPoPP’08 8
BIT NODE PROCESSING – BN
2. Calculates the message sent from BNn to CNm including channel information Pn:
3. Then computes the a posteriori pseudo-probabilities and performs hard decoding:
1
'' ( )\
(0) 1 (0)i i
nm n m nnmm N n m
q k p r
,n
1 1 0.5ˆ
0 1 0.5
i
nn i
n
Qc
Q
BNn
rin
r jn
r kn
qnm
P n
CNi
CNm CNj
CNk
( )
(0) 1 (0)i i
n n mnnm M n
Q k p r
Salt Lake City, Feb 21st 2008 PPoPP’08 9
INTENSIVE COMPUTING
"If you were plowing a field, which would you rather use? Two strong oxen or 1024 chickens?"-- Seymore Cray
Salt Lake City, Feb 21st 2008 PPoPP’08 10
GRAPHICS PROCESSING UNITS (GPUs)
Raw compute power increasing rapidly Manycores architecture Can be programmed outside the graphics framework Exposing parallelism Multi-threaded architecture using CUDA Interest in GPP on GPUs Hard programming Needs efficient interface GPU wins when arithmetic intensity is maximized… GPU looses with memory accesses!
Salt Lake City, Feb 21st 2008 PPoPP’08 11
SUM PRODUCT ALGORITHM (SPA)
Kernel 1 - Computes the messages sent from CNm to BNn probability of BNn being 0 or 1
1
'' ( )\
1
'' ( )\
1
'' ( )\
Kernel 1 - Horizontal Processing
1 1(0) (1 2 (1))
2 2
(1) 1 (0)
Kernel 2 - Vertical Processing
(0) 1 (0)
(1) (1)
i i
mn n mn N m n
i i
mn mn
i i
nm n m nnmm M n m
i i
nm nm n m nm M n m
r q
r r
q k p r
q k p r
Kernel 2 – Computes the messages from BNn to CNm
Salt Lake City, Feb 21st 2008 PPoPP’08 12
COMPACT DATA STRUCTURES – H MATRIX
H mapped into compact HBN and HCN data structures
1 1 1 0 0 0 0 0
0 0 0 1 1 1 0 0
1 0 0 1 0 0 1 0
0 1 0 0 1 0 0 1
(8 bit nodes checked by 4check node equations)H =
word 2
r0,0
0 1
r0,1
0 2
r0,2
0 0
r1,3
1 0
r1,4
1 1
r1,5
0 3
r2,0
1 3
r2,3
2 0
r2,6
1 2
r3,1
2 2
r3,4
2 3
r3,7
2 1
r0,0
0 1
r0,1
0 2
r0,2
0 0
r1,3
1 0
r1,4
1 1
r1,5
0 3
r2,0
1 3
r2,3
2 0
r2,6
1 2
r3,1
2 2
r3,4
2 3
r3,7
2 1
r0,0
0 1
r0,1
0 2
r0,2
0 0
r1,3
1 0
r1,4
1 1
r1,5
0 3
r2,0
1 3
r2,3
2 0
r2,6
1 2
r3,1
2 2
r3,4
2 3
r3,7
2 1
HBN
word 1
r0,0
0 1
r0,1
0 2
r0,2
0 0
r1,3
1 0
r1,4
1 1
r1,5
0 3
r2,0
1 3
r2,3
2 0
r2,6
1 2
r3,1
2 2
r3,4
2 3
r3,7
2 1
word n
for all CNm do: (rows in H)
for all BNn do: (columns in H)
If Hmn==1 then
pnext = j:Hmn==1,
// with n+1< j <(n+N) mod N
HBN=pnext
Salt Lake City, Feb 21st 2008 PPoPP’08 13
COMPUTING KERNELS ON THE GPU
A novel SPA multi-thread computing approach SPA iteratively performed by several KERNELS on GPU
Flow control and execution management of KERNELS performed by the CUDA programming interface
Kernel 1ps0 rs0
k2k1HCN
HBN
Kernel 2qs0 rsi qsi
Kernel 1rsi
k2k1HCNHBN
Kernel 2
THREAD (0, 0)
THREAD (1, 0)
THREAD (0, 0) THREAD (0, 0)
THREAD (1, 0) THREAD (1, 0)
THREAD (0, 0)
THREAD (1, 0)
SY
NC
HR
ON
IZA
TIO
N
PO
INT
SY
NC
HR
ON
IZA
TIO
N
PO
INT
SY
NC
HR
ON
IZA
TIO
N
PO
INT
SY
NC
HR
ON
IZA
TIO
N
PO
INT
Salt Lake City, Feb 21st 2008 PPoPP’08 14
CUDA INTERFACE FOR GPGPU
C based programming interface for NVIDIA’s 8x series and next generation
CUDA enables efficient use of their massive parallelism Multi-threading hides latency problems Allows transparent programming Slow global memory and fast shared memory acess Avoid non-coalesced memory accesses Significant speedups depending on the algorithm Hard challenge: irregular memory access patterns!
Salt Lake City, Feb 21st 2008 PPoPP’08 15
MULTI-THREAD COMPUTING APPROACH
Multi-thread strategy and architecture
GRID
GPU
BLOCK (0, 0) BLOCK (1, 0) BLOCK (2, 0)
BLOCK (2, 1)BLOCK (1, 1)BLOCK (0, 1)
BLOCK (1, 0)
THREAD (0, 0)
REGISTERS
LOCALMEMORY
SHARED MEMORY
THREAD (1, 0)
LOCALMEMORY
REGISTERS
BLOCK (2, 0)
THREAD (0, 0)
REGISTERS
LOCALMEMORY
SHARED MEMORY
THREAD (1, 0)
LOCALMEMORY
REGISTERS
GLOBAL MEMORY
BLOCK (X, 0)
BLOCK (X, 1)
BLOCK (2, Y)BLOCK (1, Y)BLOCK (0, Y) BLOCK (X, Y)
Salt Lake City, Feb 21st 2008 PPoPP’08 16
MULTI-THREAD COMPUTING APPROACH
Circular addressing mechanism allows increase of parallelism
GRID
GPU
BLOCK (0, 0)
r NU
LL
BLOCK (1, 0)
r NU
LL
BLOCK (X, 0)
r NU
LL
BLOCK (0, 1)
r NU
LL
BLOCK (1, 1)
r NU
LL
BLOCK (X, 1)
r NU
LL
BLOCK (0, Y)
r NU
LL
BLOCK (1, Y)
r NU
LL
BLOCK (X, Y)
r NU
LL
THREAD (tx, 0)
rNULL
THREAD (tx, 1)
rNULL
THREAD (tx, 2)
rNULL
BLOCK (0,0)
THREAD (0, 0)
r0,0
THREAD (1, 0)
r0,1
THREAD (2, 0)
r0,2
THREAD (3, 0)
rNULL
THREAD (0, 1)
r1,3
THREAD (1, 1)
r1,4
THREAD (2, 1)
r1,5
THREAD (3, 1)
rNULL
THREAD (0, 2)
r2,0
THREAD (1, 2)
r2,3
THREAD (2, 2)
r2,6
THREAD (3, 2)
rNULL
THREAD (tx, ty)
rNULL
THREAD (0, ty)
rx1,y1
THREAD (1, ty)
rx2,y2
THREAD (2, ty)
rx3,y3
THREAD (3, ty)
rNULL
Salt Lake City, Feb 21st 2008 PPoPP’08 17
MULTI-THREAD COMPUTING APPROACH
GRID
GPU
BLOCK (0, 0)
q NU
LL
BLOCK (1, 0)
q NU
LL
BLOCK (2, 0)
q NU
LL
BLOCK (X, 0)
q NU
LL
BLOCK (0, 1)
q NU
LL
BLOCK (1, 1)
q NU
LL
BLOCK (2, 1)
q NU
LL
BLOCK (X, 1)
q NU
LL
BLOCK (0, 2)
q NU
LL
BLOCK (1, 2)
q NU
LL
BLOCK (2, 2)
q NU
LL
BLOCK (X, 2)
q NU
LL
BLOCK (0, Y)
q NU
LL
BLOCK (1, Y)
q NU
LL
BLOCK (2, Y)
q NU
LL
BLOCK (X, Y)
q NU
LL
qn,m0
qn,m1
qn,m2
qn',m0
qn',m1
qn',m2
THREAD (A)
THREAD (B)
THREAD (C)
THREAD (D)
THREAD (E)
THREAD (F)
Salt Lake City, Feb 21st 2008 PPoPP’08 18
EXPERIMENTAL RESULTS
Matrix size
CPU GPU CPU GPU CPU GPU
25 iterations 50 iterations 100 iterations
512x1024 3.5 0.2 6.9 0.4 13.9 0.8
2448x4896 16.7 0.8 33.3 1.6 66.5 3.1
2000x4000 21.0 1.1 41.9 2.2 84.0 4.2
Main conclusions ( … obtained from the matrices we considered using CUDA):
• Much faster processing than on top notch CPUs• Supports floating-point operations • Achieves medium to large throughputs• BUT MOST DEFINITELLY NOT AS GREAT AS WE HOPED!
Salt Lake City, Feb 21st 2008 PPoPP’08 19
CONCLUSIONS AND FUTURE WORK
GPGPU approach for LDPC decoding New compact data structures to represent the H matrix Multi-thread algorithm for LDPC decoding
Significant speedups achieved with the CUDA programming interface Up to 22
GPUs allow a software based, scalable and low cost solution
Trading task parallelism by data parallelism Adoption/generalization of the proposed approach
(algorithms and data structures) for irregular processing in graphs
Salt Lake City, Feb 21st 2008 PPoPP’08 20
CONCLUSIONS
Gabriel Falcão, [email protected]
University of Coimbra
Technical University of Lisbon
Portugal