massively parallel ldpc decoding on gpu vivek tulsidas bhat priyank gupta

32
MASSIVELY PARALLEL LDPC DECODING ON GPU Vivek Tulsidas Bhat Priyank Gupta

Upload: lionel-todd

Post on 26-Dec-2015

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MASSIVELY PARALLEL LDPC DECODING ON GPU Vivek Tulsidas Bhat Priyank Gupta

MASSIVELY PARALLEL LDPC DECODING ON GPU

Vivek Tulsidas Bhat

Priyank Gupta

Page 2: MASSIVELY PARALLEL LDPC DECODING ON GPU Vivek Tulsidas Bhat Priyank Gupta

“Workload Partitioning”

Priyank Motivation and LDPC introduction. Analysis of the sequential algorithm and

build up to the parallelization strategy. Lessons Learned : Part 1

Vivek Parallelization strategy Results and Discussion Lessons Learned : Part 2 Conclusion

Page 3: MASSIVELY PARALLEL LDPC DECODING ON GPU Vivek Tulsidas Bhat Priyank Gupta

Motivation

FEC codes used extensively in various applications to ensure reliability in communication.

Current trends in application show demands in increased data rates.

Considering Shannon Limit, low complexity encoders-decoders necessary.

Enter LDPC : Low-Density Parity Check.

Page 4: MASSIVELY PARALLEL LDPC DECODING ON GPU Vivek Tulsidas Bhat Priyank Gupta

LDPC : Quick Overview

Iterative approach. Inherently data-parallel Computationally

expensive. Therefore, perfect

candidate for operations that can be parallelized.

Page 5: MASSIVELY PARALLEL LDPC DECODING ON GPU Vivek Tulsidas Bhat Priyank Gupta

Our Initial Approach

Page 6: MASSIVELY PARALLEL LDPC DECODING ON GPU Vivek Tulsidas Bhat Priyank Gupta

Parallel Code Flow

Found Codeword or Max Iter. Report Results

Likelihood Ratio Initialization

Probability Ratio Initialization

Likelihood Ratio Recomputation

Probability Ratio Recomputation

Next Guess Calculation

No Yes

Page 7: MASSIVELY PARALLEL LDPC DECODING ON GPU Vivek Tulsidas Bhat Priyank Gupta

Analysis of Sequential Code

Page 8: MASSIVELY PARALLEL LDPC DECODING ON GPU Vivek Tulsidas Bhat Priyank Gupta

Sparse Matrix Representation

typedef struct /* Representation of a sparse matrix */{ int n_rows; /* Number of rows in the matrix */ int n_cols; /* Number of columns in the matrix */

mod2entry *rows; /* Ptr to array of row headers */ mod2entry *cols; /* Ptr to array of column headers */

mod2block *blocks; /* Allocated Blocks*/ mod2entry *next_free; /* Next free entry */

} mod2sparse;

typedef struct /* Structure representing a non-zero entry, or the header for a row or column */

{ int row, col; /* Row and column indexes */

mod2entry *left, *right, /* Pointers to adjacent entry in row */ *up, *down; /* and column, or to headers. Free */ /* entries are linked by 'left'.*/

double pr, lr; /* Probability and likelihood ratios - not used */ /* by the mod2sparse module itself */} mod2entry;

Page 9: MASSIVELY PARALLEL LDPC DECODING ON GPU Vivek Tulsidas Bhat Priyank Gupta

Likelihood Ratio Computation

LR_estimator = 1 (initial)Forward Transition:

element_LR(nth) = LR_estimator(nth)LR_estimator(n+1th) = LR_estimator(nth) *2/element_PR(n+1th) - 1

Reverse Transition:temp = element_LR(nth) * LR_estimator(nth)element_LR (n-1th) = (1-temp) / (1+temp)LR_estimator(n-1th) = LR_estimator(nth) *2/element_PR(n-1th) - 1

1 0 0 1 1 1 0

0 1 0 1 1 0 1

0 0 1 0 1 1 1

Page 10: MASSIVELY PARALLEL LDPC DECODING ON GPU Vivek Tulsidas Bhat Priyank Gupta

Probability Ratio Computation

1 0 0 1 1 1 0

0 1 0 1 1 0 1

0 0 1 0 1 1 1

PR_estimator(nth) = Likelihood_Ratio (nth) (initial)

Top-Down Transition:

element_PR(nth) = PR_estimator(nth)

PR_estimator(n+1th) = PR_estimator(nth) * element_LR(nth)

Bottom-Up Transition:

element_PR (n-1th) = element_PR (nth) * PR_estimator(nth)

PR_estimator(n-1th) = PR_estimator(nth) * element_LR(nth)

Page 11: MASSIVELY PARALLEL LDPC DECODING ON GPU Vivek Tulsidas Bhat Priyank Gupta

Lessons Learned : Part 1

"entities must not be multiplied beyond necessity"

Page 12: MASSIVELY PARALLEL LDPC DECODING ON GPU Vivek Tulsidas Bhat Priyank Gupta

Parallelization Strategy

Page 13: MASSIVELY PARALLEL LDPC DECODING ON GPU Vivek Tulsidas Bhat Priyank Gupta

Transformation

Codeword i

Likelihood Ratio Computation

Probability Ratio Recomputation

Next Guess Calculation

Found Codeword or Max Iter.

No YesReport Results

Codeword i-2 Codeword i-1 Codeword i+1 Codeword i+2

Page 14: MASSIVELY PARALLEL LDPC DECODING ON GPU Vivek Tulsidas Bhat Priyank Gupta

Use 1-D arrays

BSC Channel Data (N , M-bit codewords read at a time)

BSC Data Array with N codewords aligned

Likelihood ratio for all the MN bits

Bit Probabilities for MN bits

Decoded Blocks (N M-bit codewords)

Each thread does the computation for one-bit. So for N M-bit codewords, we would need MN threads for the Likelihood ratio, Probability Ratio and Decoded Block related computations

Page 15: MASSIVELY PARALLEL LDPC DECODING ON GPU Vivek Tulsidas Bhat Priyank Gupta

Likelihood Ratio Computation : Revisited

1 0 0 1 1 1 0

0 1 0 1 1 0 1

0 0 1 0 1 1 1

Likelihood Ratio Estimator calculation for Forward and Reverse Estimation done on the host before the launch of the Likelihood ratio kernel.

Note: Illustration for just one codeword. This is done for N codewords at a time.

Likelihood Ratio Estimator : Reverse Estimation

Likelihood Ratio Estimator : Forward Estimation

Page 16: MASSIVELY PARALLEL LDPC DECODING ON GPU Vivek Tulsidas Bhat Priyank Gupta

Probability Ratio Computation : Revisited

1 0 0 1 1 1 0

0 1 0 1 1 0 1

0 0 1 0 1 1 1

Likewise for the Probability Ratio Computation, only this time operations are done on a column basis

Probability Ratio Estimator : Top Down Transition

Probability Ratio Estimator : Bottom-Up Transition

Page 17: MASSIVELY PARALLEL LDPC DECODING ON GPU Vivek Tulsidas Bhat Priyank Gupta

Salient Features of our implementation Usage of efficient sparse matrix representation

of standard Parity-Check matrix. Simplistic Mathematical model for likelihood

ratio and probability ratio computation. Dedicated data structure for likelihood ratio

and probability ratio kernels. Code is easily customizable for different code

rates. Supports larger number of code words without

any major change to the program architecture.

Page 18: MASSIVELY PARALLEL LDPC DECODING ON GPU Vivek Tulsidas Bhat Priyank Gupta

Experimental SetupCPU GPU1 GPU2

Platform Intel Core 2 Duo

NVidia GeForce 8400 GS

NVidiaGeForce GT120

Clock Speed (Memory Clock)

2.6GHz 900MHz 500MHz

Memory 4GB 512MB 512MB

CUDA Toolkit Version

-NA- 2.3 2.2

Programming Environment

Linux Visual Studio

Linux

Page 19: MASSIVELY PARALLEL LDPC DECODING ON GPU Vivek Tulsidas Bhat Priyank Gupta

Results (1/3)

Tested extensively for code rate of (3,7) on BSC channel with error probability of 0.05.

Optimal execution configuration : numThreadsPerBlock = 256, numBlocks = 7* Mul_factor where mul_factor is evaluated depending on the number of code words to be decoded

mul_factor = num_codewords / numThreadsPerBlock

Bit error rate is evaluated by comparing percentage change with respect to original source file.

Page 20: MASSIVELY PARALLEL LDPC DECODING ON GPU Vivek Tulsidas Bhat Priyank Gupta

Results (2/3) : Software Execution Time

Page 21: MASSIVELY PARALLEL LDPC DECODING ON GPU Vivek Tulsidas Bhat Priyank Gupta

Results (3/3) : Bit Error Rate Curve

Page 22: MASSIVELY PARALLEL LDPC DECODING ON GPU Vivek Tulsidas Bhat Priyank Gupta

Lessons Learned : Part 2 High occupancy does not guarantee better performance. Although GPU implementation provides considerable speedup,

its BER results are not attractive (in fact worse than CPU based implementation)

Absence of a double-precision floating point unit in GPU impacted the results. Probability ratio and Likelihood ratio computations are based on double-precision arithmetic.

Reliability? Random Bit Flips ? Could be catastrophic depending on the application for which LDPC decoding is being used.

Other programming paradigms : OpenMP ? Not as attractive in terms of speedup compared to GPU, but better BER curve.

Case for built-in ECC features within GPU architecture : NVIDIA Fermi architecture!

Page 23: MASSIVELY PARALLEL LDPC DECODING ON GPU Vivek Tulsidas Bhat Priyank Gupta

Future Work

Trying this for AWGN channel for different error probabilities.

How does this perform on better GPU architectures ? Tesla ? Fermi ?

Any other parallelization strategies ? CuBLAS routines for sparse matrix computations on GPU ?

Page 24: MASSIVELY PARALLEL LDPC DECODING ON GPU Vivek Tulsidas Bhat Priyank Gupta

Acknowledgement

We would like to thank Prof. Ali Akoglu and Murat Arabaci (OCSL Lab) for guiding us throughout the course of this project.

Page 25: MASSIVELY PARALLEL LDPC DECODING ON GPU Vivek Tulsidas Bhat Priyank Gupta

References

Gabriel Falcao, Leonel Sousa, Vitor Silva, “How GPUs can outperform ASICs for Fast LDPC Decoding”, ICS’09.

Gabriel Falcao, Leonel Sousa, Vitor Silva, “ Parallel LDPC Decoding on the Cell/B.E. Processor”, HiPEAC 2009.

Gregory M. Striemer, Ali Akoglu, “An Adaptive LDPC Engine for Space Based Communication Systems”.

Page 26: MASSIVELY PARALLEL LDPC DECODING ON GPU Vivek Tulsidas Bhat Priyank Gupta

Questions : Ask!

Page 27: MASSIVELY PARALLEL LDPC DECODING ON GPU Vivek Tulsidas Bhat Priyank Gupta

Backup Slides

Page 28: MASSIVELY PARALLEL LDPC DECODING ON GPU Vivek Tulsidas Bhat Priyank Gupta

Code Transformation: Likelihood ratio Init Kernel

Page 29: MASSIVELY PARALLEL LDPC DECODING ON GPU Vivek Tulsidas Bhat Priyank Gupta

Code Transformation: Initprp Decode Kernel

Page 30: MASSIVELY PARALLEL LDPC DECODING ON GPU Vivek Tulsidas Bhat Priyank Gupta

Code Transformation: Likelihood Ratio Kernel

Page 31: MASSIVELY PARALLEL LDPC DECODING ON GPU Vivek Tulsidas Bhat Priyank Gupta

Code Transformation: Probability Ratio Kernel

Page 32: MASSIVELY PARALLEL LDPC DECODING ON GPU Vivek Tulsidas Bhat Priyank Gupta

Code Transformation: Next Guess Kernel