ieee transactions on circuits and systems—i:...

8
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 62, NO. 1, JANUARY 2015 177 One Minimum Only Trellis Decoder for Non-Binary Low-Density Parity-Check Codes Jesús O. Lacruz, Francisco García-Herrero, Javier Valls, Member, IEEE, and David Declercq, Senior Member, IEEE Abstract—A one minimum only decoder for Trellis-EMS (OMO T-EMS) and for Trellis-Min-max (OMO T-MM) is proposed in this paper. In this novel approach, we avoid computing the second minimum in messages of the check node processor, and propose efcient estimators to infer the second minimum value. By doing so, we greatly reduce the complexity and at the same time im- prove latency and throughput of the derived architectures com- pared to the existing implementations of EMS and Min-max de- coders. This solution has been applied to various NB-LDPC codes constructed over different Galois elds and with different degree distributions showing in all cases negligible performance loss com- pared to the ideal EMS and Min-max algorithms. In addition, two complete decoders for OMO T-EMS and OMO T-MM were imple- mented for the (837,726) NB-LDPC code over GF(32) for compar- ison proposals. A 90 nm CMOS process was applied, achieving a throughput of 711 Mbps and 818 Mbps respectively at a clock fre- quency of 250 MHz, with an area of 19.02 and 16.10 after place and route. To the best knowledge of the authors, the pro- posed decoders have higher throughput and area-time efciency than any other solution for high-rate NB-LDPC codes with high Galois eld order. Index Terms—Check node processing, low-latency, NB-LDPC, OMO T-EMS, OMO T-MM, VLSI design. I. INTRODUCTION S INCE THE FIRST non-binary low-density parity-check (NB-LDPC) decoder architecture was proposed for the Q-ary Sum-of-Product algorithm (QSPA) [1], hardware de- signers have been working to derive solutions that allow the use of NB-LDPC codes in a wide range of communication and storage systems. Good error correction, high throughput and small area remain the challenge of any NB-LDPC decoder designer. Manuscript received May 20, 2014; revised July 19, 2014; accepted Au- gust 18, 2014. Date of publication October 08, 2014; date of current version January 06, 2015. This work was supported in part by the Spanish Ministerio de Ciencia e Innovación under Grant TEC2011-27916 and in part by the Uni- versitat Politècnica de València under Grant PAID-06-2012-SP20120625. The work of F. García-Herrero was supported by the Spanish Ministerio de Edu- cación under Grant AP2010-5178. David Declercq has been funded by the In- stitut Universitaire de France for this project. This paper was recommended by Associate Editor Z. Zhang. J. O. Lacruz is with the Electrical Engineering Department, Universidad de Los Andes, Mérida, 5101, Venezuela (e-mail: [email protected]). F. García-Herrero, and J. Valls are with the Instituto de Telecomunicaciones y Aplicaciones Multimedia, at Universitat Politècnica de València, 46730 Gandia, Spain (e-mail: [email protected]; [email protected]). D. Declercq is with the ETIS Laboratory, ENSEA/Univ. Cergy-Pon- toise/CNRS-UMR-8051, F-95000, Cergy-Pontoise, France (e-mail: david.de- [email protected]). Color versions of one or more of the gures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identier 10.1109/TCSI.2014.2354753 Extended Min-Sum (EMS) [2] and Min-Max (MM) [3] al- gorithms were proposed, with the aim of reducing the com- plexity involved in the check node processor, which is the bot- tleneck of QSPA algorithm. Although the decoding process is simplied by means of using forward-backward for the extrac- tion of check-to-variable messages, these metrics penalize the maximum throughput achievable when they are implemented in hardware. To avoid the use of forward-backward, in [4] the Trellis Extended Min-Sum (T-EMS) was proposed. With T-EMS, the degree of parallelism is increased using only combinations of the most reliable Galois eld (GF) symbols to compute the check-to-variable messages. The decoder presented in [4] was outperformed in [5] where an extra column is added to the original trellis with the purpose of generating the check-to-vari- able messages in a parallel way. The main drawback of the approach presented in [5] is that requires a lot of area and pipeline stages, reducing the overall efciency of the decoder. In [6] the hardware implementation of a T-EMS decoder is described, reaching the highest throughput found in literature. Previous trellis-based proposals, such as the ones from [7]–[9], applied partial-parallel decoding as a way to obtain the output messages in the check node processor. In [10] a decoder architecture named Relaxed Min-Max (RMM) is proposed. RMM makes an approximation for the second minimum calculation and hence, generates the check-to-variable messages with less complexity. The main drawbacks for this approach are: i) the check node output messages are derived serially, reducing the overall throughput of the decoder and increasing latency; and ii) the proposed approach suffers of an early degradation in the error oor region, due to the way of deriving the second minimum. In this paper, we introduce a novel second minimum approx- imation based on the statistical analysis of the check node mes- sages named as One Minimum Only (OMO) decoder. The mo- tivation to perform this approximation is that the two-minimum nder duplicates the critical path and increases the complexity of the check node processor. In addition to the second minimum estimator proposed in [10], we analyze two other estimators: one based on a slight modication of the one-minimum nder, and a last one which linearly combines the rst two estima- tors, and showed the best performance in simulations. The pro- posed OMO decoder can be applied to both T-EMS and Trellis Min-max obtaining OMO T-EMS and OMO T-MM decoders respectively. By avoiding the use of two-minimum nders [11] in the check node, we were able to reduce both area and latency 1549-8328 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Upload: others

Post on 28-Sep-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: …kresttechnology.com/krest-academic-projects/krest...Index Terms—Check node processing, low-latency, NB-LDPC, OMO T-EMS, OMO T-MM,

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 62, NO. 1, JANUARY 2015 177

One Minimum Only Trellis Decoder for Non-BinaryLow-Density Parity-Check Codes

Jesús O. Lacruz, Francisco García-Herrero, Javier Valls, Member, IEEE, and David Declercq, Senior Member, IEEE

Abstract—A one minimum only decoder for Trellis-EMS (OMOT-EMS) and for Trellis-Min-max (OMO T-MM) is proposed inthis paper. In this novel approach, we avoid computing the secondminimum in messages of the check node processor, and proposeefficient estimators to infer the second minimum value. By doingso, we greatly reduce the complexity and at the same time im-prove latency and throughput of the derived architectures com-pared to the existing implementations of EMS and Min-max de-coders. This solution has been applied to various NB-LDPC codesconstructed over different Galois fields and with different degreedistributions showing in all cases negligible performance loss com-pared to the ideal EMS and Min-max algorithms. In addition, twocomplete decoders for OMOT-EMS andOMOT-MMwere imple-mented for the (837,726) NB-LDPC code over GF(32) for compar-ison proposals. A 90 nm CMOS process was applied, achieving athroughput of 711 Mbps and 818 Mbps respectively at a clock fre-quency of 250 MHz, with an area of 19.02 and 16.10after place and route. To the best knowledge of the authors, the pro-posed decoders have higher throughput and area-time efficiencythan any other solution for high-rate NB-LDPC codes with highGalois field order.

Index Terms—Check node processing, low-latency, NB-LDPC,OMO T-EMS, OMO T-MM, VLSI design.

I. INTRODUCTION

S INCE THE FIRST non-binary low-density parity-check(NB-LDPC) decoder architecture was proposed for the

Q-ary Sum-of-Product algorithm (QSPA) [1], hardware de-signers have been working to derive solutions that allow theuse of NB-LDPC codes in a wide range of communicationand storage systems. Good error correction, high throughputand small area remain the challenge of any NB-LDPC decoderdesigner.

Manuscript received May 20, 2014; revised July 19, 2014; accepted Au-gust 18, 2014. Date of publication October 08, 2014; date of current versionJanuary 06, 2015. This work was supported in part by the Spanish Ministeriode Ciencia e Innovación under Grant TEC2011-27916 and in part by the Uni-versitat Politècnica de València under Grant PAID-06-2012-SP20120625. Thework of F. García-Herrero was supported by the Spanish Ministerio de Edu-cación under Grant AP2010-5178. David Declercq has been funded by the In-stitut Universitaire de France for this project. This paper was recommended byAssociate Editor Z. Zhang.J. O. Lacruz is with the Electrical Engineering Department, Universidad de

Los Andes, Mérida, 5101, Venezuela (e-mail: [email protected]).F. García-Herrero, and J. Valls are with the Instituto de Telecomunicaciones y

AplicacionesMultimedia, at Universitat Politècnica de València, 46730 Gandia,Spain (e-mail: [email protected]; [email protected]).D. Declercq is with the ETIS Laboratory, ENSEA/Univ. Cergy-Pon-

toise/CNRS-UMR-8051, F-95000, Cergy-Pontoise, France (e-mail: [email protected]).Color versions of one or more of the figures in this paper are available online

at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TCSI.2014.2354753

Extended Min-Sum (EMS) [2] and Min-Max (MM) [3] al-gorithms were proposed, with the aim of reducing the com-plexity involved in the check node processor, which is the bot-tleneck of QSPA algorithm. Although the decoding process issimplified by means of using forward-backward for the extrac-tion of check-to-variable messages, these metrics penalize themaximum throughput achievable when they are implementedin hardware.To avoid the use of forward-backward, in [4] the Trellis

Extended Min-Sum (T-EMS) was proposed. With T-EMS, thedegree of parallelism is increased using only combinations ofthe most reliable Galois field (GF) symbols to compute thecheck-to-variable messages. The decoder presented in [4] wasoutperformed in [5] where an extra column is added to theoriginal trellis with the purpose of generating the check-to-vari-able messages in a parallel way. The main drawback of theapproach presented in [5] is that requires a lot of area andpipeline stages, reducing the overall efficiency of the decoder.In [6] the hardware implementation of a T-EMS decoder isdescribed, reaching the highest throughput found in literature.Previous trellis-based proposals, such as the ones from [7]–[9],applied partial-parallel decoding as a way to obtain the outputmessages in the check node processor.In [10] a decoder architecture named Relaxed Min-Max

(RMM) is proposed. RMM makes an approximation forthe second minimum calculation and hence, generates thecheck-to-variable messages with less complexity. The maindrawbacks for this approach are: i) the check node outputmessages are derived serially, reducing the overall throughputof the decoder and increasing latency; and ii) the proposedapproach suffers of an early degradation in the error floorregion, due to the way of deriving the second minimum.In this paper, we introduce a novel second minimum approx-

imation based on the statistical analysis of the check node mes-sages named as One Minimum Only (OMO) decoder. The mo-tivation to perform this approximation is that the two-minimumfinder duplicates the critical path and increases the complexityof the check node processor. In addition to the second minimumestimator proposed in [10], we analyze two other estimators:one based on a slight modification of the one-minimum finder,and a last one which linearly combines the first two estima-tors, and showed the best performance in simulations. The pro-posed OMO decoder can be applied to both T-EMS and TrellisMin-max obtaining OMO T-EMS and OMO T-MM decodersrespectively. By avoiding the use of two-minimum finders [11]in the check node, we were able to reduce both area and latency

1549-8328 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: …kresttechnology.com/krest-academic-projects/krest...Index Terms—Check node processing, low-latency, NB-LDPC, OMO T-EMS, OMO T-MM,

178 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 62, NO. 1, JANUARY 2015

of the check node update without introducing any performanceloss compared to the original EMS or Min-max algorithms.The OMO T-EMS and OMO T-MM check node architec-

tures have been implemented and included in a layered sched-uling decoder. A 90 nm CMOS process has been employedand a (837,726) NB-LDPC code over GF(32) has been chosento show the efficiency of our approach for high order fieldsand high rate codes. The OMO T-EMS and OMO T-MM de-coders achieve 100% and 159% higher efficiency (Mbps/mil-lion gates) compared to the most efficient decoder found inliterature [10] respectively, with about 30% less latency and40% higher throughput than the solution from [6] depending onwhether EMS or Min-max version is implemented.The rest of the paper is organized as follows: in Section II we

introduce the nomenclature and the main concepts of T-EMSalgorithm. The proposed approach for the second minimum es-timation of T-EMS algorithm, OMO T-EMS, is presented inSection III, including and analysis of performance for differentNB-LDPC codes and showing that can be extended to Min-maxalgorithm without loss of generality. Section IV includes thehardware implementation of the proposed check node and theoverall decoder. Moreover, synthesis and post place and routeresults of the design and comparisons with other architecturesare also included. Finally, conclusions are outlined in Section V.

II. TRELLIS—EXTENDED MIN-SUM ALGORITHM

NB-LDPC codes are characterized by a sparse parity checkmatrix where each non-zero element belongs to Ga-lois field . We consider regular NB-LDPC codeswith constant row weight and column weight . Decodingalgorithms for NB-LDPC codes use iterative message exchangebetween two types of nodes called check nodes (CN) (M rowsof ) and variable nodes (VN) (N columns of ).Let be the set of VN (CN) connected to a CN

(VN) . Let and be the edge messagesfromVN to CN and from CN to VN for each symbolrespectively. denotes the channel information andthe a posteriori information.Let and be the

transmitted codeword and received symbol sequence re-spectively, with and is the error vectorintroduced by the communication channel. The log-likeli-hood ratio (LLR) for each received symbol is obtained as

ensuring that allvalues are non-negative where is the symbol associated tothe highest reliability.Trellis Extended Min-Sum (T-EMS) algorithm [4] presents a

way of implementing the original EMS algorithm [2], avoidingthe use of the forward-backward metrics and increasing the de-gree of parallelism of the CN. Algorithm 1 includes the originalT-EMS CN algorithm, where the first and fifth steps perform thetransformation of incoming messages from “normal” todelta domain and from delta domain to normal do-main for the CN outgoing messages respec-tively. For the transformation, syndrome of the CNmust be obtained (Step 2 of Algorithm 1) using the incomingtentative hard decision for each CN message.

Algorithm 1: T-EMS Algorithm

Input: ,

for do

1

end

2

3

for do

4

5

end

Output:

The extra column calculation is derived on step 3using the configuration sets originally proposed in [2], with theaim of building the output messages using only the most reli-able information. The configuration set is definedas the set of at most symbols that satisfy the parity equa-tion, deviating at most times from the combination (path) ofsymbols with the highest reliability. Considering only the casewhen and , the extra column is built withthe combination of the most reliable messages for each GF(q)symbol i.e. with the minimum value message, .Once the values are derived, the outgoing CN mes-

sages in delta domain , are generated in Algorithm 1using step 4 which provides all the values for extrinsic CN out-going messages. For the intrinsic values, andare used as it is explained in detail in [5].It is important to remark that the values are used

for both, and generation while isonly used to compute (in the case of and

). Additionally, the extraction of the position of the firstminimum is also required , since this informationis used to derive the path for each extra column value in thetrellis. However, the two minimum values must be processedusing a two-minimum finder before the extra column calcula-tion. This two-minimum finder increases the critical path for

due to the extraction.In next section we propose a novel approach to approximate

the second minimum, which at the same time that reduces thecritical path to get the first minimum, achieves an accurate es-timation of the second one without degrading the performanceof the original T-EMS and Min-max algorithms.

III. ONE MINIMUM ONLY TRELLIS DECODER

As shown in Section II, the two-minimum finder representsan important part of the CN architecture. On the other hand,the hardware architectures to implement the minimum finderprocessor [11] introduce the same delay for both and

, which is not optimal for EMS and Min-max algo-rithms.

Page 3: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: …kresttechnology.com/krest-academic-projects/krest...Index Terms—Check node processing, low-latency, NB-LDPC, OMO T-EMS, OMO T-MM,

LACRUZ et al.: ONE MINIMUM ONLY TRELLIS DECODER FOR NON-BINARY LOW-DENSITY PARITY-CHECK CODES 179

Fig. 1. Histograms for the different estimators of . The value wasset to 1.125.

Fig. 2. Histograms showing the error distribution of different estimators of. The value was set to 1.125.

This observation is our principal motivation for creating anovel check node architecture which approximates the com-putation of the second minimum, reducing the delay for thefirst one and hence improving the latency and the throughputof the decoder as it can be seen in next sections. Our proposedapproach has been tested on multiple NB-LDPC codes withdifferent GF(q) and degree distribution, showing in all casesa negligible performance loss compared with the T-EMS andMin-max algorithms. In order to simplify the description of ourproposal, we will focus on T-EMS, however, this method can bedirectly derived to Min-max algorithm without any loss of gen-erality. However, we will provide performance and implemen-tation results of this new solution for both EMS and Min-maxdecoders.In the rest of the section, different estimators of the second

minimum values are considered, and a statistical analysis oftheir distribution compared to the true secondminimum ismade.The analysis is done for the (837,726) NB-LDPC code overGF(32), where is generated using the methods proposed in[12], with circulant sub-matrices of size . How-ever, other codes with different GF and degree distribution havebeen tested obtaining the same behavior.

A. Estimators for the Second Minimum Value

A first natural solution for the estimation of is to makeuse of a scaled version of the first minimum , described in(1):

(1)

Fig. 3. Second minimum estimation based on a radix-2 one-minimum finder.Example for an eight inputs tree.

This approximation has been already proposed in [10]. How-ever, by just applying (1) the value of the minimum is usuallyunderestimated if we apply a value that mimics as much aspossible the behavior of EMS or Min-max in the waterfall re-gion.1 As it can be seen in Fig. 1, where we draw the distri-butions of the true and their proposed estimators, thevalue of is on average smaller than the real ,which leads to an important performance degradation in theerror floor region.A second possible estimator makes advantage of a re-use of

the hardware architecture. Using a radix-2 one-minimum finderis possible to determine an early estimation for the second min-imum. In Fig. 3, a one-minimum tree finder is presented. In thefigure, we include an extra multiplexor in the last stage, that al-lows extracting the looser term, denoted . By doingso and just using an extra multiplexor, this term can be used asan early estimator of the second minimum, which represents anupper-bound on the true minimum value. If the truevalue is located in the other half part of the tree that( branches of the minimum tree finder not connected to

), then we obtain . In the othercases, . Hence, the resultant value cor-responds to an provable upper bound on the true . Asystematic overestimation of the second minimum value couldlead also to performance degradation of the complete decoder,and we propose to combine and in order toget an estimator with a better statistical behavior.As we will demonstrate with a statistical analysis in the next

section, both and are biased estimators, oneoverestimating the true second minimum, and the other one un-derestimating the true second minimum. We therefore proposein this paper to combine those two estimators, by using a linearcombination of the two preceding estimators, in the followingway:

(2)Compared to the real values, presents a

similar behavior in the histogram shown in Fig. 1 which impliesthat the proposed estimation has similar statistical behavior thanthe exact values.

1 is selected as themean value of the ratio between and .

Page 4: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: …kresttechnology.com/krest-academic-projects/krest...Index Terms—Check node processing, low-latency, NB-LDPC, OMO T-EMS, OMO T-MM,

180 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 62, NO. 1, JANUARY 2015

TABLE ISTATISTICAL PROPERTIES OF THE DIFFERENT ESTIMATORS AFTER AND DECODING ITERATIONS

The operations involved to implement (2) can be performedafter and values are obtained (using thehardware structure in Fig. 3). Therefore, the second minimumestimation can be made at the same time that valuesare obtained, to finally calculate check-to-variable outputmessages.In the next section, we analyse the statistical behavior of each

of the three proposed estimators.

B. Statistical Analysis of the Different Estimators

In Fig. 2, we plot the distributions of the difference be-tween the proposed estimators, being definedfollowing one of the (1), (2), and the true minimum, i.e.

. We performed this analysis by com-puting for each iteration and for different values thedifference between the real second minimum at the checknode and each one of the estimators. The information for thisanalysis is computed based on all the check nodes of theparity check matrix.From the shape of the distributions, we can see that

seems to be biased and skewed tothe positive values of the difference, which means that not only

underestimates the true minimum, but also that thedifference is not symmetric around its bias. Of course, as itwas expected for , which represents aupper bound on the true second minimum, we get the oppositebehavior, as the distribution of is leftbiased and skewed. In order to better measure the performanceof each estimator, we have computed the first four cumulantsof the distributions , and reported theirvalues in Table I after 1 and 15 decoding iterations. The firstcumulant of the distribution is the mean, and measures the biasof the estimator, a value of zero indicating that the estimatoris unbiased. The second cumulant is the square-root of thevariance, which indicates the spread of the estimator aroundthe mean value. The third cumulant is the skewness, and is ameasure of the symmetry of the distributions. A zero skewnessindicates that an estimator does not favor positive or negativedifference with the true value . Finally, the fourthcumulant, called the kurtosis is a measure of the flatness of thetails of the distribution. A low value of the kurtosis indicatesthat very large outliers values of the difference with the trueminimum do not appear with high probability. The kurtosisvalue for a Gaussian distribution is equal to 3.As we can see from those tables, typically tends to

underestimate the true value, since both the mean andthe skewness are positive, while overestimates thetrue value, since both the mean and the skewness arenegative (which was expected as is actually an upperbound of ). As we can see, the third estimator that wepropose, namely , is a better estimation than the other

2, with respect to all statistics. First it is practically unbiased atthe first iteration, although a slight positive bias seems to appearat iteration 15. The skewness is almost zero, which tells us that,on average, the decoder will underestimate or overestimate the

with the same frequency. Finally, both the variance andthe kurtosis of are the minimum amongthe three estimators, which indicates that values very differentthan the true minimum will appear less often withthan with the other two estimators. With respect to those indi-cators, is a better estimator of than or

.We will confirm in the next section that alsoprovides the maximum gain in error correction performance forthe overall LDPC decoder.

C. Frame Error Rate Performance

To prove the correct behavior of the proposed OMO T-EMSand OMO T-MM algorithms, we performed Frame Error Rate(FER) simulations for NB-LDPC codes with different degreedistributions and Galois field values, from GF(4) to GF(32),assuming transmission over Binary Phase Shift Keying (BPSK)modulation and Additive White Gaussian Noise (AWGN)channel. In this subsection we only include the performance fortwo different codes, the (837,726) NB-LDPC code over GF(32)with and and the (2212,1896) NB-LDPCcode over GF(4) with and . For the rest of thecodes we obtained similar results. We compare the proposedOMO T-EMS and OMO T-MM approaches to T-EMS [6] andRelaxed Min-Max (RMM) algorithms [10].Fig. 4 shows the frame error rate (FER) simulation results

of the (837,726) NB-LDPC code. For this code, the proposedOMO T-EMS algorithm in its floating point version (fp)achieves the same performance as T-EMS algorithm withoutany performance loss. Both algorithms use 15 iterations (it)and a scaling factor . In addition, the OMO T-MMalgorithm has a coding gain of 0.12 dB compared to the RMMfrom [10]. Comparing the quantized version of OMO T-EMSalgorithm to RMM algorithm, OMO T-EMS algorithm with7 bits (7b) for the datapath and 9 iterations achieves the sameperformance as RMM [10] with 5 bits for the datapath and 15iterations, so the proposed approach requires less iterationsthan the method from [10] to achieve the same performance.For the quantized version of the OMO T-MM the performancewith 6 bits (6b) and 8 iterations achieves the same than theRMM decoder.On Fig. 5, we have plotted the performance of the T-EMS

decoder with 15 iterations, and for all the approximations ofthe second minimum discussed in this paper. The curve labeledT-EMS uses the exact computed value of . As we can see,the fact that and do not estimate accurately thesecond minimum has indeed an impact on the overall decoderperformance, and especially in the error floor region, where a

Page 5: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: …kresttechnology.com/krest-academic-projects/krest...Index Terms—Check node processing, low-latency, NB-LDPC, OMO T-EMS, OMO T-MM,

LACRUZ et al.: ONE MINIMUM ONLY TRELLIS DECODER FOR NON-BINARY LOW-DENSITY PARITY-CHECK CODES 181

Fig. 4. FER performance for the (837,726) NB-LDPC code over GF(32) withAWGN channel. Layered schedule is applied to all algorithms. forTEMS and OMO T-EMS algorithms. for OMO T-EMS algorithm.

for OMO T-MM algorithm.

Fig. 5. FER performance for the (837,726) NB-LDPC code over GF(32) withAWGN channel for the estimators of the second minimum value.for OMO T-EMS algorithm.

strong early flattening appears for both approximations (espe-cially for ). On the other hand, our proposed approxima-tion has absolutely no performance loss compared to theT-EMS with the exact minimum computation, both in the wa-terfall and the error floor regime. It results that the complexitygains provided by the OMO-T-EMS comes at no performanceloss, at least for the codes that we simulated.In Fig. 6, simulations for the (2212,1896) NB-LDPC show

a negligible performance loss of 0.03 dB for acomparing T-EMS to OMO T-EMS. The same happens whenwe compare Min-max algorithm to OMO T-MM, just a negli-gible difference of 0.04 dB is introduced by the approximation.The values in all simulations are adjusted using the mean ofthe ratio as we said before.

IV. OMO T-EMS AND OMO T-MM HARDWAREARCHITECTURES

In this section the hardware architectures for the proposedOMO T-EMS and OMO T-MM are introduced. Since the maincontribution of this paper focuses in the CN processing, firstwe detail the implementation results for the OMO T-EMS andOMOT-MMCN architectures comparing them to other existingsolutions. Finally, we present the results for the complete de-coders with horizontal layered schedule.

Fig. 6. FER performance for the (2212,1896) NB-LDPC code overwith AWGN channel. Layered schedule is applied to all algorithms. forT-EMS and OMO T-EMS algorithms. for OMO T-EMS algorithm.

for MM and OMO T-MM algorithms. for OMO T-MMalgorithm.

Fig. 7. Check node top architecture for T-EMS algorithm (a). Proposed OMOT-EMS/OMO T-MM check node architecture (b).

A. Check Node Architecture

In Fig. 7(a) the original T-EMS hardware structure [6] is in-cluded, while the proposed OMO T-EMS CN structure is pre-sented in Fig. 7(b). It can be observed that the main advantageof our approach is to avoid the use of two-minimum finders andapply one-minimum finders, reducing the total complexity andthe delay for the values, introducing the novel second-minimum estimation. To do this approximation, the block la-beled “ Processor” is responsible to implement the (2).The Processor does not introduce any additional delaysince the processing is made at the same time that thevalues are computed. Moreover, it is important to remark thatthe OMO decoding technique can be directly implemented fora Min-max decoder obtaining the same advantages.For both OMOT-EMS andOMOT-MMalgorithms, the row-

wise search of the most reliable messages implies thatthe one-minimum finder must have inputs and includes anextra multiplexor in the last stage to extract the valuesas shown in Fig. 3. For the CN one-minimum finders arerequired, each one formed by -bit comparators and

2-input multiplexors, where is the number of bitsfor the datapath. On the other hand, compared with the two-min-imum finders [11], the critical path is reduced by half due to thereduction of the hardware spent on the second minimum com-putation, which will impact greatly on the obtained throughput.

Page 6: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: …kresttechnology.com/krest-academic-projects/krest...Index Terms—Check node processing, low-latency, NB-LDPC, OMO T-EMS, OMO T-MM,

182 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 62, NO. 1, JANUARY 2015

TABLE IICN COMPLEXITY COMPARISONS. FOR THE (837,726) NB-LDPC CODE OVER

GF(32)

Tomake a fair comparison with the two-minimum finder usedin conventional designs, we must add extra resources to im-plement (2), which reduce to -bit adders forthe one-minimum finders. This value is calculated consideringthat the implementation of (1) and (2) need only two additionaladders.On the other hand, a conventional two-minimum finder [11]

requires -bit comparators and 2-input mul-tiplexors. Considering the same number of equivalent gates foran adder and a comparator ( bits both of them), the two-min-imum finder has three times more multiplexors and two timesmore comparators than the one minimum finder plus the secondminimum estimation implementing (2).For the and transformation, the approach

used is similar to the one proposed in [13], requiring2-input multiplexors to perform both transforma-

tions. The check node’s syndrome is calculated adding alltentative hard decision symbols by means of a GF(q) adderin a tree structure fashion needing XOR gates.The extra column values are generated using a configuration

processor similar to the one proposed in [6] using-bit adders or comparators to generate the tentative

extra column values depending on whether EMS or Min-maxcheck node is implemented. To select the most reliable value,

one minimum finders are required, each one formed by-bit comparators and 2-input multiplexors.

To compute the path info,2-input multiplexors, XOR gates and

OR gates are implemented.The resources required for the CN implementation of OMO

T-EMS and OMO T-MM are summarized in Table II and com-pared with the approaches from [10] and [6]. VHDL was usedfor the description of the hardware and the total gate accountwas derived after synthesis using Cadence RTL Compiler. Thehardware implementation was performed for the (837,726)NB-LDPC code over , with and .As it can be observed, although the CN in [10] has less NAND

gates than our proposals, their CN requires to store interme-diate messages due to the serial processing, increasing the gateaccount of the CN to 230714 NAND gates (considering thatstoring one bit of RAM memory is equivalent in terms of areato 1.5 NAND gates [10], [14], [15]). Hence, our proposals re-quires at most 18% less logical resources than the CN presentedin [10], even considering that we use two extra bits.For T-EMS decoder presented in [6] we did not provide sep-

arate results for the CN architecture, however we obtained theseresults considering the main differences with our new proposal.The CN from [6] needs about four times more hardware than ourOMO approaches for the extra column values computation due

TABLE IIICOMPARISON OF THE PROPOSED NB-LDPC LAYERED DECODERS TO OTHERWORKS FROM LITERATURE. FOR THE (837,726) NB-LDPC CODE OVER

GF(32)

to the use of the first and second minimum for the extraction ofthe extra column values. As we can see in Table II OMOT-EMSand OMO T-MM require 37% less logical resources than [6].

B. Complete Decoder Architecture

The proposed CN architecture has been included in an hori-zontal layered schedule decoder with one CN cell (Fig. 7) andVN units. Each VN processor includes dual-port memories thatstore the LLR values and avoid adding extra latency.On the other hand, due to the layered schedule, a shift registeris required to store the “last iteration” CN output information

.Since, only one CN cell is implemented in the decoder,

clock cycles are required to complete one decoding iteration.This value is increased due to the pipeline stages intro-duced in the decoder ( clock cycles are added) with theaim of achieving the desired clock frequency . As afterprocessing one entire circulant sub-matrix the pipeline regis-ters must be empty before processing a new one, reducing thelogical path of the decoder has a great impact in the maximumthroughput achieved by the decoder (3). Finally, additionalclock cycles are required to load the LLR values and output theestimated codeword of the decoder.

(3)

With OMO T-EMS we reduce the critical path of the CN, sowe only require 8 pipeline stages to achieve a clock frequencyof after place and route with Cadence SOCencounter tools and employing a 90 nm CMOS library in whichthe area of a NAND gate is 3.13 . The total area of thedecoder is 19.02 with a core occupation of 60% and a gateaccount of million of NAND gates.For OMO T-MM the number of pipeline stages is 8 and the

maximum clock frequency is . The total areais 16.10 with a core occupation of 70% and a gate accountof million NAND gates.

Page 7: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: …kresttechnology.com/krest-academic-projects/krest...Index Terms—Check node processing, low-latency, NB-LDPC, OMO T-EMS, OMO T-MM,

LACRUZ et al.: ONE MINIMUM ONLY TRELLIS DECODER FOR NON-BINARY LOW-DENSITY PARITY-CHECK CODES 183

It is important to remark that the library used to implementboth OMO T-EMS and OMO T-MM do not include optimizedRAM memories, so each bit of RAM is implemented as a reg-ister, and hence, the area for the memories is about three timeslarger. Due to this, the total number of equivalent NAND gatesis overestimated compared to the results found in literature thatalways assume optimized memories. For this reason we includein Table III, for comparison purposes, the equivalent number ofNAND gates assuming that each bit of RAM is implementedwith and area of 1.5 NAND gates.To achieve the same performance as in [10] and [6], our OMO

T-EMS approach requires only 9 decoding iterations, as can beseen in Fig. 4, therefore the total latency of the decoder is 1435clock cycles, which corresponds to a throughput of 729 Mbps(3). For the OMO T-MM 1279 clock cycles are required to getthe same FER performance as RMM or T-EMS, so a maximumthroughput of 818 Mbps is reached.OMO T-EMS and OMO T-MM decoders have been com-

pared to the most efficient NB-LDPC decoder designs to thebest knowledge of the authors. The results of the comparisonshave been included in Table III, where we have scaled the re-sults in [10] to include all throughput results over 90 nm CMOSprocess [16].The throughput of both OMO T-EMS and OMO T-MM de-

coders is higher than any decoder proposed in literature for highorder fields and high rate NB-LDPC codes (see Table III), be-cause of the improvements made at the check node processor.Our approaches require in the worst case (OMO T-EMS) less

than half area than [15] and achieve at least 3.2 times higherthroughput, so our most complex solution is 13 times more ef-ficient.2

On the other hand, the decoder presented in [10] has beenconsidered since it was the most efficient one until now andit uses (1) as a method to approximate the second minimum,which gives benefits in terms of area but introduces early per-formance degradation (Fig. 4). Despite this, OMO T-EMS has8.8 times less latency than [10] achieving 4.75 times higherthroughput with a decoder 49.7% more efficient in terms of areaover throughput (for a 90 nm CMOS process). On the otherhand, OMO TMM has 61.2% higher efficiency than [10] with9.9 times less latency and 5.3 times higher throughput.Finally, our OMO T-EMS approach has been compared

against the T-EMS decoder presented in [6]. Making use of thenovel approach for the second minimum estimation, the latencyis reduced on 33% with respect to [6] with an increment inthroughput of 50%. The area was also reduced in 25%, so theefficiency is 50% higher.It is important to remark that the proposed approach is

focused on high-rate NB-LDPC codes. However, efficientNB-LDPC decoders suitable for lower rate codes have beenproposed in the literature [17], [18]. These architectures makea parallel processing of messages.

V. CONCLUSIONS

In this paper a new method to estimate the second minimumvalue in message of the check node processor of NB-LDPC de-

2Note that [15] is the only proposal that also provides post place and routeresults.

coders is proposed. This solution avoids the use of two-min-imum finders, greatly reducing the check node complexity. Thesimplifications applied to the T-EMS and T-MM algorithmsreduce latency and area with respect to the original proposal,without introducing any significant performance loss. The pro-posed check node architecture was included in a complete de-coder with layered schedule achieving 729 Mbps of throughputafter place and route on a 90 nm CMOS process for OMOT-EMS and 818 Mbps for OMO T-MM. The designed decodernearly doubles the efficiency of the best solutions found in lit-erature for high order fields and high rate codes.

REFERENCES

[1] M. Davey and D. MacKay, “Low-density parity check codes overGF(q),” IEEE Commun. Lett., vol. 2, no. 6, pp. 165–167, 1998.

[2] D. Declercq and M. Fossorier, “Decoding algorithms for nonbinaryLDPC codes over GF(q),” IEEE Trans. Commun., vol. 55, no. 4, pp.633–643, 2007.

[3] V. Savin, “Min-max decoding for non binary LDPC codes,” in Proc.IEEE Int. Symp. Inf. Theory, 2008, pp. 960–964.

[4] E. Li, K. Gunnam, and D. Declercq, “Trellis based extended min-sumfor decoding nonbinary LDPC codes,” in Proc. Proc. 8th Int. Symp.Wireless Commun. Syst. (ISWCS), 2011, pp. 46–50.

[5] E. Li, D. Declercq, and K. Gunnam, “Trellis-based extended min-sumalgorithm for non-binary LDPC codes and its hardware structure,”IEEE Trans. Commun., vol. 61, no. 7, pp. 2600–2611, 2013.

[6] E. Li, D. Declercq, K. Gunnam, F. García-Herrero, J. Lacruz, and J.Valls, “Low latency T-EMS decoder for NB-LDPC codes,” in Conf.Rec. 47th Asilomar Conf. Signals, Syst., Comput. (ASILOMAR) 2013.

[7] X. Zhang and F. Cai, “Reduced-latency scheduling scheme formin-max non-binary LDPC decoding,” in Proc. IEEE Asia PacificConf. Circuits Syst. (APCCAS), Dec. 2010, pp. 414–417.

[8] X. Zhang and F. Cai, “Reduced-complexity decoder architecture fornon-binary LDPC codes,” IEEE Trans. Very Large Scale Integr. (VLSI)Syst., vol. 19, no. 7, pp. 1229–1238, Jul. 2011.

[9] M. Punekar andM. Flanagan, “Trellis-based check node processing forlow-complexity nonbinary LP decoding,” in Proc. IEEE Int. Symp. Inf.Theory (ISIT), Jul. 2011, pp. 1653–1657.

[10] F. Cai and X. Zhang, “Relaxed min-max decoder architectures for non-binary low-density parity-check codes,” IEEE Trans. Very Large ScaleIntegr. (VLSI) Systems, vol. 21, no. 11, pp. 2010–2023, Nov. 2013.

[11] C.-L. Wey, M.-D. Shieh, and S.-Y. Lin, “Algorithms of finding the firsttwominimum values and their hardware implementation,” IEEE Trans.Circuits Syst. I, Reg. Papers, vol. 55, no. 11, pp. 3430–3437, 2008.

[12] B. Zhou, J. Kang, S. Song, S. Lin, K. Abdel-Ghaffar, andM. Xu, “Con-struction of non-binary quasi-cyclic LDPC codes by arrays and arraydispersions,” IEEE Trans. Commun., vol. 57, no. 6, pp. 1652–1662,2009.

[13] J. Lin, J. Sha, Z. Wang, and L. Li, “Efficient decoder design for non-binary quasicyclic LDPC codes,” IEEE Trans. Circuits Syst. I, Reg.Papers, vol. 57, no. 5, pp. 1071–1082, 2010.

[14] X. Chen and C.-L.Wang, “High-throughput efficient non-binary LDPCdecoder based on the simplified min-sum algorithm,” IEEE Trans. Cir-cuits Syst. I, Reg. Papers, vol. 59, no. 11, pp. 2784–2794, Nov. 2012.

[15] Y.-L. Ueng, K.-H. Liao, H.-C. Chou, and C.-J. Yang, “A high-throughput trellis-based layered decoding architecture for non-binaryLDPC codes using max-log-QSPA,” IEEE Trans. Signal Process., vol.61, no. 11, pp. 2940–2951, 2013.

[16] J. Rabaey, A. Chandrakasan, and B. Nikolic, Digital Integrated Cir-cuits: A Design Perspective, 2nd ed. Upper Saddle River, NJ, USA:Prentice-Hall, 2003.

[17] J. Lin and Z. Yan, “An efficient fully parallel decoder architecturefor nonbinary LDPC codes,” IEEE Trans. Very Large Scale Integr.(VLSI) Syst. 2013 [Online]. Available: http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=6687216

[18] Y. S. Park, Y. Tao, and Z. Zhang, “A 1.15 Gb/s fully parallel nonbinaryLDPC decoder with fine-grained dynamic clock gating,” in IEEE Int.Solid-State Circuits Conf. Dig. Tech. Papers (ISSCC), Feb. 2013, pp.422–423.

Page 8: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: …kresttechnology.com/krest-academic-projects/krest...Index Terms—Check node processing, low-latency, NB-LDPC, OMO T-EMS, OMO T-MM,

184 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 62, NO. 1, JANUARY 2015

Jesús O. Lacruz received the B.S. degree in elec-trical engineering from the Universidad de LosAndes, Merida, Venezuela in 2009 and the M.S.degree in electrical engineering from the UniversitatPolitècnica de València, Valencia, Spain in 2013. Heis currently pursuing the Ph.D degree in electricalengineering from the Universitat Politècnica deValència. Since 2010, he has been an Assistant Pro-fessor with the Electrical Engineering Department,Universidad de Los Andes. His research interestsinclude the design of VLSI architectures for digital

communications, especially on error correction coding.

Francisco García-Herrero received the B.S. degreein telecommunication engineering from EscuelaPolitecnica Superior de Gandia, Spain, in 2008, andthe M.S. and Ph.D. degrees in electrical engineeringfrom Universitat Politècnica de València, Spain, in2010 and 2013, respectively. He is currently withthe Institute of Telecommunications and MultimediaApplications (iTEAM), Valencia. His research inter-ests include hardware and algorithmic optimizationof error-control decoders.

Javier Valls received the Telecommunication En-gineering degree from the Universidad Politecnicade Cataluna, Cataluna, Spain, and the Ph.D. degreein telecommunication engineering from the Univer-sidad Politecnica de Valencia, Valencia, Spain, in1993 and 1999, respectively. He has been with theDepartment of Electronics, Universidad Politecnicade Valencia since 1993, where he is currently anAssociate Professor. His current research interestsinclude the design of FPGA-based systems, com-puter arithmetic, VLSI signal processing, and digital

communications.

David Declercq (SM’11) was born in June 1971.He received his Ph.D. degree in statistical signalprocessing from the University of Cergy-Pontoise,France, in 1998. He is currently full professor atthe ENSEA in Cergy-Pontoise. He is the generalsecretary of the National GRETSI association. Heis currently the recipient of a junior position at theInstitut Universitaire de France. His research topicslie in digital communications and error-correctioncoding theory. He worked several years on theparticular family of LDPC codes, both from the

code and decoder design aspects. Since 2003, he developed a strong expertiseon non-binary LDPC codes and decoders in high order Galois fields GF(q).A large part of his research projects are related to non-binary LDPC codes.He mainly investigated two aspects: i) the design of GF(q) LDPC codes forshort and moderate lengths, and ii) the simplification of the iterative decodersfor GF(q) LDPC codes with complexity/performance tradeoff constraints. Hehas published more than 30 papers in major journals (IEEE TRANSACTIONSON COMMUNICATIONS, IEEE TRANSACTIONS ON INFORMATION THEORY,Communications Letters, EURASIP JWCN), and more than 100 papers inmajor conferences in information theory and signal processing.