implementation of a high throughput 3gpp turbo decoder …gw2/pdf/jsps2011_turbo_decoder.pdf ·...

13
J Sign Process Syst DOI 10.1007/s11265-011-0617-7 Implementation of a High Throughput 3GPP Turbo Decoder on GPU Michael Wu · Yang Sun · Guohui Wang · Joseph R. Cavallaro Received: 27 January 2011 / Revised: 1 June 2011 / Accepted: 7 August 2011 © Springer Science+Business Media, LLC 2011 Abstract Turbo code is a computationally intensive channel code that is widely used in current and up- coming wireless standards. General-purpose graphics processor unit (GPGPU) is a programmable commod- ity processor that achieves high performance computa- tion power by using many simple cores. In this paper, we present a 3GPP LTE compliant Turbo decoder ac- celerator that takes advantage of the processing power of GPU to offer fast Turbo decoding throughput. Sev- eral techniques are used to improve the performance of the decoder. To fully utilize the computational re- sources on GPU, our decoder can decode multiple codewords simultaneously, divide the workload for a single codeword across multiple cores, and pack mul- tiple codewords to fit the single instruction multiple data (SIMD) instruction width. In addition, we use shared memory judiciously to enable hundreds of con- current multiple threads while keeping frequently used data local to keep memory access fast. To improve efficiency of the decoder in the high SNR regime, we also present a low complexity early termination scheme based on average extrinsic LLR statistics. Finally, we examine how different workload partitioning choices affect the error correction performance and the de- coder throughput. Keywords GPGPU · Turbo decoding · Accelerator · Parallel computing · Wireless · Error control codes · Turbo codes M. Wu (B ) · Y. Sun · G. Wang · J. R. Cavallaro Rice University, Houston, TX, USA e-mail: [email protected] 1 Introduction Turbo code [1] has become one of the most impor- tant research topics in coding theory since its discov- ery in 1993. As a practical code that can offer near channel capacity performance, Turbo codes are widely used in many 3G and 4G wireless standards such as CDMA2000, WCDMA/UMTS, IEEE 802.16e WiMax, and 3GPP LTE (long term evolution). The inherently large decoding latency and complex iterative decoding algorithm have made it very difficult to be implemented in a general purpose CPU or DSP. As a result, Turbo decoders are typically implemented in hardware [28]. Although ASIC and FPGA designs are more power efficient and can offer extremely high throughput, there are a number of applications and research fields, such as cognitive radio and software based wireless test- bed platforms such as WARPLAB [9], which require the support for multiple standards. As a result, we want an alternative to dedicated silicon that supports a variety of standards and yet delivers good throughput performance. GPGPU is an alternative to dedicated silicon which is flexible and can offer high throughput. GPU employs hundreds of cores to process data in parallel, which is well suited for a number of wireless communica- tion algorithms. For example, many computationally intensive blocks such as channel estimation, MIMO detection, channel decoding and digital filters can be implemented on GPU. Authors in [10] implemented a complete 2 × 2 WiMAX MIMO receiver on GPU. In addition, there are a number of recent papers on MIMO detection [11, 12]. There are also a number of GPU based LDPC channel decoders [13]. Despite the popularity of Turbo codes, there are few existing Turbo

Upload: vuongngoc

Post on 30-Jun-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

J Sign Process SystDOI 10.1007/s11265-011-0617-7

Implementation of a High Throughput 3GPPTurbo Decoder on GPU

Michael Wu · Yang Sun · Guohui Wang ·Joseph R. Cavallaro

Received: 27 January 2011 / Revised: 1 June 2011 / Accepted: 7 August 2011© Springer Science+Business Media, LLC 2011

Abstract Turbo code is a computationally intensivechannel code that is widely used in current and up-coming wireless standards. General-purpose graphicsprocessor unit (GPGPU) is a programmable commod-ity processor that achieves high performance computa-tion power by using many simple cores. In this paper,we present a 3GPP LTE compliant Turbo decoder ac-celerator that takes advantage of the processing powerof GPU to offer fast Turbo decoding throughput. Sev-eral techniques are used to improve the performanceof the decoder. To fully utilize the computational re-sources on GPU, our decoder can decode multiplecodewords simultaneously, divide the workload for asingle codeword across multiple cores, and pack mul-tiple codewords to fit the single instruction multipledata (SIMD) instruction width. In addition, we useshared memory judiciously to enable hundreds of con-current multiple threads while keeping frequently useddata local to keep memory access fast. To improveefficiency of the decoder in the high SNR regime, wealso present a low complexity early termination schemebased on average extrinsic LLR statistics. Finally, weexamine how different workload partitioning choicesaffect the error correction performance and the de-coder throughput.

Keywords GPGPU · Turbo decoding · Accelerator ·Parallel computing · Wireless · Error control codes ·Turbo codes

M. Wu (B) · Y. Sun · G. Wang · J. R. CavallaroRice University, Houston, TX, USAe-mail: [email protected]

1 Introduction

Turbo code [1] has become one of the most impor-tant research topics in coding theory since its discov-ery in 1993. As a practical code that can offer nearchannel capacity performance, Turbo codes are widelyused in many 3G and 4G wireless standards such asCDMA2000, WCDMA/UMTS, IEEE 802.16e WiMax,and 3GPP LTE (long term evolution). The inherentlylarge decoding latency and complex iterative decodingalgorithm have made it very difficult to be implementedin a general purpose CPU or DSP. As a result, Turbodecoders are typically implemented in hardware [2–8].Although ASIC and FPGA designs are more powerefficient and can offer extremely high throughput, thereare a number of applications and research fields, suchas cognitive radio and software based wireless test-bed platforms such as WARPLAB [9], which requirethe support for multiple standards. As a result, wewant an alternative to dedicated silicon that supportsa variety of standards and yet delivers good throughputperformance.

GPGPU is an alternative to dedicated silicon whichis flexible and can offer high throughput. GPU employshundreds of cores to process data in parallel, whichis well suited for a number of wireless communica-tion algorithms. For example, many computationallyintensive blocks such as channel estimation, MIMOdetection, channel decoding and digital filters can beimplemented on GPU. Authors in [10] implementeda complete 2 × 2 WiMAX MIMO receiver on GPU.In addition, there are a number of recent papers onMIMO detection [11, 12]. There are also a number ofGPU based LDPC channel decoders [13]. Despite thepopularity of Turbo codes, there are few existing Turbo

J Sign Process Syst

decoder implementations on GPU [14, 15]. Comparedto LDPC decoding, implementing a Turbo decoder onGPU is more challenging as the algorithm is fairlysequential and difficult to parallelize.

In our implementation, we attempt to increase com-putational resource utilization by decoding multiplecodewords simultaneously, and by dividing a codewordinto several sub-blocks to be processed in parallel. Asthe underlying hardware architecture is single instruc-tion multiple data (SIMD), we pack multiple sub-blocksto fit the SIMD vector width. Finally, as an excessiveuse of shared memory decreases the number of threadsthat run concurrently on the device, we attempt tokeep frequently used data local while reducing theoverall shared memory usage. We include an early ter-mination algorithm by evaluating the average extrinsicLLR from each decoding cycle to improve throughputperformance in the high signal to noise ratio (SNR)regime. Finally, we provide both throughput and biterror rate (BER) performance of the decoder and showthat we can parallelize the workload on GPU whilemaintaining reasonable BER performance.

The rest of the paper is organized as follows: InSections 2 and 3, we give an overview of the CUDA ar-chitecture and Turbo decoding algorithm. In Section 4,we will discuss the implementation aspects on GPU.Finally, we will present BER performance and through-put results and analyses in Section 5 and conclude inSection 6.

2 Compute Unified Device Architecture (CUDA)

A programmable GPU offers extremely high compu-tation throughput by processing data in parallel us-ing many simple stream processors (SP) [16]. Nvidia’sFermi GPU offers up to 512 SP grouped into multiplestream multi-processors (SM). Each SM consists of 32SP and two independent dispatch units. Each dispatchunit on an SM can dispatch a 32 wide SIMD instruction,a warp instruction, to a group of 16 SP. During execu-tion, a group of 16 SP processes the dispatched warp in-struction in a data parallel fashion. Input data is storedin a large amount of external device memory (>1GB)connected to the GPU. As latency to device memoryis high, there are fast on-chip resources to keep dataon-die. The fastest on-chip resource is registers. Thereis a small amount of 64KB fast memory per SM, splitbetween user-managed shared memory and L1 cache.In addition, there is an L2 cache per GPU device whichfurther reduces the number of slow device memoryaccesses.

There are two ways to leverage the computationalpower of Nvidia GPUs. Compute Unified Device Ar-chitecture [16] is an Nvidia specific software program-ming model, while OpenCL is a portable open standardwhich can target different many core architectures suchas GPUs and conventional CPUs. These two program-ming models are very similar but utilize different modelterminologies. Although we implemented our designusing CUDA, the design can be readily ported intoOpenCL to target other multi-core architectures.

In the CUDA programming model, the programmerspecifies the parallelism explicitly by defining a kernelfunction, which describes a sequence of operations ap-plied to a data set. Multiple thread-blocks are spawnedon GPU during kernel launch. Each thread-block con-sists of multiple threads, where each thread is arrangedon a grid and has a unique 3-dimensional ID. Using theunique ID, each thread selects a data set and executesa kernel function on the selected data set.

At runtime, each thread-block is assigned to an SMand executed independently. Thread-blocks typicallyare synchronized by writing to device memory andterminating the kernel. Unlike thread-blocks, threadswithin a thread-block, which reside on a single SM, canbe synchronized through barrier synchronization andshare data through shared memory. Threads within athread-block execute in blocks of 32 threads. When 32threads share the same set of operations, they sharethe same warp instruction and are processed in parallelin an SIMD fashion. If threads do not share the sameinstruction, the threads are executed serially.

To achieve peak performance on a programmableGPU, the programmer needs to keep the availablecomputation resource fully utilized. Underutilizationoccurs due to horizontal and vertical waste. Verticalwaste occurs when an SM stalls and cannot find aninstruction to issue. And horizontal waste occurs whenthe issue width is larger than the available parallelism.

Vertical waste occurs primarily due to pipeline stalls.Stalls occur for several reasons. As the floating pointarithmetic pipeline is long, register to register depen-dencies can cause a multi-cycle stall. In addition, an SMcan stall waiting for device memory reads or writes.In both cases, GPU has hardware support for fine-grain multithreading to hide stalls. Multiple threads, orconcurrent threads, can be mapped onto an SM andexecuted on an SM simultaneously. The GPU can min-imize stalls by switching over to another independentwarp instruction on a stall. In the case where a stallis due to memory access, the programmer can fetchfrequently used data into shared memory to reducememory access latency. However, as the number ofconcurrent threads is limited by the amount of shared

J Sign Process Syst

memory and registers used per thread-block, the pro-grammer needs to balance the amount of on-chip mem-ory resources used. Shared memory increases compu-tational throughput by keeping data on-chip. However,excessive amount of shared memory used per thread-block reduces the number of concurrent threads andleads to vertical waste.

Although shared memory can improve performanceof a program significantly, there are several limitations.Shared memory on each SM is banked 16 ways. It takesone cycle if 16 consecutive threads access the sameshared memory address (broadcast) or none of thethreads access the same bank (one to one). However,a random layout with some broadcast and some one-to-one accesses will be serialized and cause a stall. Theprogrammer may need to modify the memory accesspattern to improve efficiency.

Horizontal waste occurs when there is an insufficientworkload to keep all of the cores busy. On a GPUdevice, this occurs if the number of thread-blocks issmaller than the number of SM. The programmer needsto create more thread-blocks to handle the workload.Alternatively, the programmer can solve multiple prob-lems at the same time to increase efficiency. Horizontalwaste can also occur within an SM. This can occurif the number of threads in a thread-block is not amultiple of 32. For this case, the programmer needs todivide the workload of a thread-block across multiplethreads if possible. An alternative is to pack multiplesub-problems into one thread-block as close to thewidth of the SIMD instruction as possible. However,packing multiple problems into one thread-block mayincrease the amount of shared memory used, whichleads to vertical waste. Therefore, the programmer mayneed to balance horizontal waste and vertical waste tomaximize performance.

As a result, it is a challenging task to implement analgorithm that keeps the GPU cores from idling–weneed to partition the workload across cores, use sharedmemory effectively to reduce device memory accesses,while ensuring a sufficient number of concurrently exe-cuting thread-blocks to hide stalls.

3 Turbo Decoding Algorithm

Turbo decoding is an iterative algorithm that canachieve error performance close to the channel ca-pacity. A Turbo decoder consists of two componentdecoders and two interleavers, which is shown in Fig. 1.The Turbo decoding algorithm consists of multiplepasses through the two component decoders, where oneiteration consists of one pass through both decoders.

-1

Decoder 1Decoder 0Lc(ys)

Lc(yp0)

Lc(yp1)

La

Le

La+LchLe+Lch

Π

Π

Figure 1 Overview of Turbo decoding.

Although both decoders perform the same sequenceof computations, the decoders generate different log-likelihood ratios (LLRs) as the two decoders havedifferent inputs. The inputs of the first decoder arethe deinterleaved extrinsic log-likelihood ratios (LLRs)from the second decoder and the input LLRs fromthe channel. The inputs of the second decoder are theinterleaved extrinsic LLRs from the first decoder andthe input LLRs from the channel.

Each component decoder is a MAP (maximum aposteriori) decoder. The principle of the decoding al-gorithm is based on the BCJR or MAP algorithms [17].Each component decoder generates an output LLR foreach information bit. The MAP decoding algorithm canbe summarized as follows. To decode a codeword withN information bits, each decoder performs a forwardtrellis traversal to compute N sets of forward statemetrics, one α set per trellis stage. The forward tra-versal is followed by a backward trellis traversal whichcomputes N sets of backward state metrics, one β setper trellis stage. Finally, the forward and the backwardmetrics are merged to compute the output LLRs. Wewill now describe the metric computations in detail.

As shown by Fig. 2, the trellis structure is definedby the encoder. The 3GPP LTE Turbo code trellis haseight states per stage. In the trellis, for each state in thetrellis, there are two incoming paths, with one path for

stage k stage k+1

Ub = 1

Ub = 0

Figure 2 3GPP LTE Turbo code trellis with eight states.

J Sign Process Syst

ub = 0 and one path for ub = 1. Let sk be a state at stagek, the transition probability is defined as:

γk(sk−1, sk) = (Lc

(ys

k

) + La(ys

k

))uk + Lc

(yp

k

)pk, (1)

where uk, the information bit, and pk, the parity bit,are dependent on the path taken (sk+1, sk). Lc(ys

k) isthe systematic channel LLR, La(ys

k) is the a priori LLR,and Lc(yp

k ) is the parity bit channel LLR at stage k.During the forward traversal, the sets of state metrics

are computed recursively as the next set of state metricsis dependent on the current set of state metrics. Theforward state metric for a state sk at stage k, αk(sk), isdefined as:

αk(sk) = max∗sk−1∈K(αk−1(sk−1) + γ (sk−1, sk)), (2)

where K is the set of paths that connects a state in stagek − 1 to state sk in stage k.

After the forward traversal, the decoder performsa backward traversal to compute the backward statemetrics recursively. The backward state metric for statesk at stage k, βk(sk), is defined as:

βk(sk) = max∗sk+1∈K(βk+1(sk+1) + γ (sk+1, sk)). (3)

After computing βk, the state metrics for all statesin stage k, we compute two LLRs per trellis state. Wecompute one state LLR per state sk, �(sk|uk = 0), forthe incoming path that is connected to state sk whichcorresponds to uk = 0. In addition, we also computeone state LLR per state sk, �(sk|ub = 1), for the incom-ing path that is connected to state sk which correspondsto uk = 1. The state LLR, �(sk|ub = 0), is defined as:

�(sk|ub = 0) = ak−1(sk−1) + γ (sk−1, sk) + βk(sk), (4)

where the path from sk−1 to sk with ub = 0 is used in thecomputation. Similarly, the state LLR, �(sk|ub = 1), isdefined as:

�(sk|ub = 1) = ak−1(sk−1) + γ (sk−1, sk) + βk(sk), (5)

where the path from sk−1 to sk with ub = 1 is used in thecomputation.

To compute the extrinsic LLR for uk, we perform thefollowing computation:

Le(k) = ∗maxsk∈K

(�(sk|ub = 0)) − ∗maxsk∈K

(�(sk|ub = 1))

−La(ysk) − Lc(ys

k), (6)

where K is the set of all possible states and max∗() isdefined as max∗(S) = ln(

∑s∈S es).

The decoding algorithm described above requiresthe completion of N stages of α before the backward

Sub-block 0 Sub-block 1 Sub-block P-1......1 p-20

2 p-11

α

β β β

α α

Figure 3 Next iteration initialization.

traversal. This is very sequential and requires a largeamount of memory to store N stages of α before thestart of β computation. There are several ways to dis-tribute the workload and memory storage across mul-tiple decoders to decrease decoding latency. Typically,the incoming codeword can be broken up into multiplesub-blocks. Each sub-block is processed independentlyand in parallel. As the starting state metrics of theforward and backward traversal are unknown for thesub-blocks, we assume a uniform distribution for thestarting state metrics. As each sub-block may not havean accurate starting metric, decoding each sub-blockindependently will result in error correction perfor-mance loss. There are a number of ways to recoverthe performance loss due to this edge effect. One isnext iteration initialization, where we forward the for-ward and backward state metric between iterations. Asshown in Fig. 3, the last α computed for sub-block ican be forwarded to the sub-block i + 1. Similarly, thelast backward metric computed by partition i + 1 canbe forwarded to sub-block i. By forwarding the metricsamong sub-blocks between iterations, these sub-blockswill have more accurate starting metrics for the nextdecoding iteration. Another common alternative is toperform the sliding window algorithm with training se-quence [18], where the ith sub-block starts the forwardand the backward traversal w samples earlier. Thisallows different sub-blocks to start at more accurateforward and backward metrics.

4 Implementation of Turbo Decoder on GPU

We implemented a parallel Turbo decoder on GPU.Instead of spawning one thread-block per codewordto perform decoding, a codeword is split into P sub-blocks and decoded in parallel using multiple thread-blocks. The algorithm described in Section 3 maps veryefficiently onto an SM since the algorithm is very dataparallel. As the number of trellis states is eight for the3GPP compliant Turbo code, the data parallelism ofthis algorithm is eight. However, the minimum numberof threads within a warp instruction is 32. Therefore,to reduce horizontal waste, we allow each thread-block

J Sign Process Syst

to process a total of 16 sub-blocks from 16 codewordssimultaneously. Therefore the number of threads perthread-block is 128, which enables four fully occupiedwarp instructions. As P sub-blocks may not be enoughto keep all the SMs busy, we also decode N codewordssimultaneously to minimize the amount of horizontalwaste due to idling cores. We spawn a total of N P

16thread-blocks to handle the decoding workload for Ncodewords. Figure 4 shows how threads are partitionedto handle the workload for N codewords.

In our implementation, the inputs of the decoder,LLRs from the channel, are copied from the hostmemory to device memory. At runtime, each group ofeight threads within a thread-block generates outputLLRs given the input for a codeword sub-block. Eachiteration consists of a pass through the two MAP de-coders. Since each half iteration of the MAP decodingalgorithm performs the same sequence of computa-tions, both halves of an iteration can be handled bya single MAP decoder kernel. After a half decodingiteration, thread-blocks are synchronized by writingextrinsic LLRs to device memory and terminating thekernel. In the device memory, we allocate memoryfor both extrinsic LLRs from the first half iterationand extrinsic LLRs from the second half iteration. Forexample, the first half iteration reads a priori LLRsand writes extrinsic LLRs interleaved. The second halfiteration reads a priori LLRs and writes extrinsic LLRsdeinterleaved. Since interleaving and deinterleavingpermute the input memory addresses, device memoryaccess becomes random. In our implementation, weprefer sequential reads and random writes over randomreads and sequential writes as device memory writesare non-blocking. This increases efficiency as the kerneldoes not need to wait for device memory writes tocomplete to proceed. One single kernel can handleinput and output reconfiguration easily with a couple

Code Word Sub-bloc

……

Code Word 0-16, Sub-block 0Code Word 0-16, Sub-block 0Codeword 0 to 15, Sub-block 0

Code Word Sub-block P-1

Code Word 0-16, Sub-block 0Code Word 0-16, Sub-block 0Codeword 16 to 31, Sub-block 0

……12

8 T

hre

ads

P

Codeword 1, Sub-block 0

Codeword 16, Sub-block 0

Codeword 0, Sub-block 0

Code Word 0-16, Sub-block 0Code Word 0-16, Sub-block 0Codeword N-16 to N-1, Sub-block 0

...

...

...

Figure 4 To decode N codewords, we divide each codeword intoP sub-blocks. Each thread-block has 128 threads and handles 16codeword sub-blocks.

of simple conditional reads and writes at the beginningand the end of the kernel.

In our kernel, we need to recover performance lossdue to edge effects as the decoding workload is parti-tioned across multiple thread-blocks. Although a slid-ing window algorithm with training sequence can beused to improve the BER performance of the decoder,it is not implemented. The next iteration initializationtechnique improves the error correction performancewith much smaller overhead. In this method, the α

and β values between neighboring thread-blocks areexchanged through device memory between iterations.

The CUDA architecture can be viewed as a specificrealization of a multi-core SIMD processor. As a result,although the implementation is optimized specificallyfor Nvidia GPUs, the general strategy can be adaptedfor other many-core architectures with vector ex-tensions. However, many other vector extensionssuch as SSE and AltiVec do not support transcen-dental functions which lead to greater throughputdifference between max-log-MAP and full-log-MAPimplementations.

The implementation details of the reconfigurableMAP kernel are described in the following subsections.

4.1 Shared Memory Allocation

If we partition a codeword with K information bitsinto P partitions, we need to compute K

P stages ofα before we can compute β. If we attempt to cachethe immediate values in shared memory, per partition,we will need to store 8K

P floats in shared memory.As we need to minimize vertical waste by decodingmultiple codewords per thread-block, the amount ofshared memory is quadrupled to pack 4 codewords intoa thread-block to match the width of a warp instruction.Since we only have 48KB of shared memory which isdivided among concurrent thread-blocks on an SM, wewill not be able to have many concurrent threads if Pis small. For example, if K = 6,144 and P = 32, theamount of shared memory required by α is 24KB. Thenumber of concurrent threads is only 64, leading to ver-tical waste as we cannot hide the pipeline latency withconcurrent running blocks. We can reduce the amountof shared memory used by decreasing P. This, however,can reduce error correction performance. Therefore,we need a better strategy for managing shared memoryinstead of relying on increasing P.

Instead of storing all α values in shared memory, wecan spill α into device memory each time we computea new α. We only store one stage of α during the for-ward traversal. For example, suppose αk−1 is in shared

J Sign Process Syst

memory. After calculating αk using αk−1, we store αk indevice memory and replace αk−1 with αk. During LLRcomputation, when we need α to compute �k(sk|ub =0) and �k(sk|ub = 1), we fetch α directly into registers.Similarly, we store one stage of βk values during thebackward traversal. Therefore, we do not need to storeβ into device memory. In order to increase threadutilization during extrinsic LLR computation, we saveup to eight stages of �k(sk|ub = 0) and eight stagesof �k(sk|ub = 1). We can reuse shared memory usedfor LLR computation, α, and β. Therefore, the totalamount of shared memory per thread-block, packing16 codewords per thread-block, is 2,048 floats or 8KB.This allows us to have 768 threads running concurrentlyon an SM while providing fast memory access most ofthe time.

4.2 Forward Traversal

During the forward traversal, eight cooperating threadsdecode one codeword sub-block. The eight cooperatingthreads traverse through the trellis in locked-step tocompute α. There is one thread per trellis level, wherethe jth thread evaluates two incoming paths and up-dates αk(s j) for the current trellis stage using αk−1, theforward metrics from the previous trellis stage k − 1.Equation 2 computes αk(s j). The computation, how-ever, depends on the path taken (sk−1, sk). The two in-coming paths are known a priori since the connectionsare defined by the trellis structure as shown in Fig. 2.Table 1 summarizes the operands needed for α compu-tation. The indices of the αk are stored as a constant.Each thread loads the indices and the values pk|ub = 0and pk|ub = 1 at the start of the kernel. The pseudo-code for one iteration of αk computation is shown inAlgorithm 1. The memory access pattern is very regularfor the forward traversal. Threads access values of αk indifferent memory banks. There are no shared memoryconflicts in either case, that is memory reads and writesare handled efficiently by shared memory.

Table 1 Operands for αk computation.

ub = 0 ub = 1

Thread id (i) sk−1 pk sk−1 pk

0 0 0 1 11 3 1 2 02 4 1 5 03 7 0 6 14 1 0 0 05 2 1 3 16 5 1 4 17 6 0 7 0

Algorithm 1 Thread i computes αk(i)a0 ← αk(sk−1|ub = 0) + Lc(ys

k) ∗ (pk|ub = 0)

a1 ← αk(sk−1|ub = 1) + (Lc(ysk) + La(k))

+Lc(psk)(pk|ub = 1)

αk(i) = max∗(a0, a1)

write αk(i) to device memorySYNC

4.3 Backward Traversal and LLR Computation

After the forward traversal, each thread-block tra-verses through the trellis backward to compute β. Weassign one thread to each trellis level to compute β,followed by computing �0 and �1 as shown in Al-gorithm 2. The indices of βk and values of pk aresummarized in Table 2. Similar to the forward traversal,there are no shared memory bank conflicts since eachthread accesses an element of α or β in a different bank.

Algorithm 2 Thread i computes βk(i) and �0(i) and�1(i)

Fetch αk(i) from device memoryb 0 ← βk+1(sk+1|ub = 0) + Lc(ys

k) ∗ (pk|ub = 0)

b 1 ← βk+1(sk+1|ub = 1) + (Lc(ysk) + La(k))

+Lc(psk)(pk|ub = 1)

βk(i) = max∗(b 0, b 1)

SYNC�0(i) = αk(i) + Lp(i)pk + βk+1(i)�1(i) = αk(i) + (Lc(k) + La(k)) + Lp(sk)pk + βk(i)

After computing �0 and �1 for stage k, we can com-pute the extrinsic LLR for stage k. However, there are8 threads available to compute the single LLR, whichintroduces parallelism overhead. Instead of computingone extrinsic LLR for stage k as soon as the decodercomputes βk, we allow the threads to traverse throughthe trellis and save eight stages of �0 and �1 be-fore performing extrinsic LLR computations. By savingeight stages of �0 and �1, we allow all eight threads

Table 2 Operands for βk computation.

ub = 0 ub = 1

Thread id (i) sk+1 pk sk+1 pk

0 0 0 4 01 4 1 0 02 5 1 1 13 1 0 5 14 2 0 6 15 6 1 2 16 7 1 3 07 3 0 7 0

J Sign Process Syst

to compute LLRs in parallel efficiently. Each threadhandles one stage of �0 and �1 to compute an LLR.Although this increases thread utilization, threads needto avoid accessing the same bank when computing anextrinsic LLR. For example, eight elements of �0 foreach stage are stored in eight consecutive addresses.Since there are 16 memory banks, elements of evenstages �0 or �1 with the same index would share thesame memory bank. Likewise, this is true for evenstages of �0. Hence, sequential accesses to �0 or �1 tocompute an extrinsic LLR result in four-way memorybank conflicts. To alleviate this problem, we permutethe access pattern based on thread ID as shown inAlgorithm 3.

Algorithm 3 Thread i computes Le(i)λ0 = �0(i)λ1 = �1(i)for j = 1 to 7 do

index = (i + j)&7λ0 = max∗(λ0, �0(index))

λ1 = max∗(λ1, �1(index))

Le = λ1 − λ0

Compute write addressWrite Le to device memory

end for

4.4 Early Termination Scheme

Depending on SNR, a Turbo decoder requires a vari-able number of iterations to achieve satisfactory BERperformance. In the mid and high SNR regime, theTurbo decoding algorithm usually converges to thecorrect codeword with a small numbers of decodingiterations. Therefore, fixing the number of decodingiterations is inefficient. Early termination schemes arewidely used to accelerate the decoding process whilemaintaining a given BER performance [19–21]. As theaverage number of decoding iterations is reduced byusing an early termination scheme, early terminationcan also reduce power consumption. As such, earlytermination schemes are widely used in low power highperformance Turbo decoder designs.

There are several major approaches to implementearly termination: hard-decision rules, soft-decisionrules, CRC-based rules and other hybrid rules. Hard-decision rules and soft-decision rules are the most pop-ular early termination schemes due to low complexity.Compared to hard-decision rules, soft-decision rulesprovide better error correcting performance than otherlow complexity early termination algorithms. There-fore, we implement two soft-decision early termination

schemes: minimum LLR threshold scheme and averageLLR threshold scheme.

The stop condition of the minimum LLR scheme canbe expressed by:

min1≤i≤N

|LLRi| ≥ T, (7)

in which we compare the minimum LLR value with apre-set threshold T at the end of each iteration. If theminimum LLR value is greater than the threshold, thenthe iterative decoding process is terminated.

The stop condition of the average LLR scheme canbe represented by:

1

N

1≤i≤N

|LLRi| ≥ T. (8)

where N is the block length of the codeword and T isthe pre-set threshold.

The simulation results show that for multi-codewordparallel Turbo decoding, the variation among mini-mum LLR values for different codewords is very large.Since each thread-block decodes 16 sub-blocks from 16codewords simultaneously, we can only terminate thethread-block if all 16 codewords meet the early termi-nation criteria. Therefore, the minimum LLR valuesare not accurate metrics for early termination for ourimplementation. The average LLR value is more stableso that the average LLR scheme is implemented for thisparallel Turbo decoder. As mentioned in the previoussub-section, during the backward traversal and LLRcomputation process, eight stages of �0 and �1 aresaved in the memory. After �0 and �1 are known,the eight threads are able to compute LLRs usingthese saved �0 and �1 values in parallel. Therefore,to compute the average LLR value in one codeword,each thread can track the sum of the LLRs when goingthrough the whole trellis. In the end of the backwardtraversal, we combine all eight sums of LLRs and com-pute the average LLR value of the codeword. Finally,this average LLR value is compared with a pre-setthreshold to determine whether the early terminationcondition is met. The detailed algorithm is described inAlgorithm 4.

Another challenge is that it is difficult to wait forhundreds of codewords to converge simultaneously andterminate the decoding process for all codewords atthe same time. Therefore, a tag-based scheme is em-ployed. Once a codeword meets the early terminationcondition, the corresponding tag is marked and thiscodeword will not be further processed in the lateriterations. After all the tags are marked, we stop theiterative decoding process for all the codewords. By us-ing a tag-based early termination scheme, the decoding

J Sign Process Syst

Algorithm 4 Early termination scheme for thread iCompute the codeword ID Cid

if tag[Cid]==1 thenTerminate the thread i

end ifForward traversalfor all Output LLR Le During Backward traversaldo

Sum(i)+ = Le

end forif threadId==0 then

Average = 1

8

∑8j=1 Sum( j)

end if

throughput can be significantly increased in the mid andthe high SNR regime.

4.5 Interleaver

An interleaver is used between the two half decod-ing iterations. Given an input address, the interleaverprovides an interleaved address. This interleaves anddeinterleaves memory writes. In our implementation,a quadratic permutation polynomial (QPP) interleaver[22], which is proposed in the 3GPP LTE standardswas used. The QPP interleaver guarantees bank freememory accesses, where each sub-block accesses adifferent memory bank. Although this is useful in anASIC design, the QPP interleaver is very memory I/Ointensive for a GPU as the memory write access patternis still random. As inputs are stored in device mem-ory, random accesses result in non-coalesced memorywrites. With a sufficient number of threads runningconcurrently on an SM, we can amortize the perfor-mance loss due to device memory accesses through fastthread switching. The QPP interleaver is defined as:

�(x) = f1x + f2x2 (mod N). (9)

The interleaver address, �(x), can be computed on-the-fly using Eq. 9. However, direct computation can causeoverflow. For example, 61432 can not be representedas a 32-bit integer. Therefore, the following equation isused to compute �(x) instead:

�(x) = ( f1 + f2x (mod N)) · x (mod N) (10)

Another way of computing �(x) is recursively [6],which requires �(x) to be computed before we cancompute �(x + 1). This is not efficient for our designas we need to compute several interleaved addressesin parallel. For example, during the second half of theiteration to store extrinsic LLR values, eight threads

need to compute eight interleaved addresses in parallel.Equation 10 allows efficient address computation inparallel.

Although our decoder is configured for the 3GPPLTE standard, one can replace the current inter-leaver function with another function to support otherstandards. Furthermore, we can define multiple in-terleavers and switch between them on-the-fly sincethe interleaver is defined in software in our GPUimplementation.

4.6 max∗ Function

We support the Full-log-MAP algorithm as well asthe Max-log-MAP algorithm [23]. Full-log-MAP isdefined as:

∗max(a, b) = max(a, b) + ln(1 + e−|b−a|). (11)

The complexity of the computation can be reduced byassuming that the second term is small. Max-log-MAPis defined as:

∗max(a, b) = max(a, b). (12)

As was the case with the interleaver, we can com-pute max∗(a, b) directly. We support Full-log-MAP asboth natural logarithm and natural exponential aresupported on CUDA. However, logarithm and naturalexponentials take longer to execute on the GPU com-pared to common floating operations, e.g. multiply andadd. Therefore we expect throughput loss compared toMax-log-Map.

5 BER Performance and Throughput Results

We evaluated the accuracy of our decoder by com-paring it against a reference standard C language im-plementation. To evaluate the BER performance andthroughput of our Turbo decoder, we tested our Turbodecoder on a Windows 7 platform with 8GB DDR2memory running at 800 MHz and an Intel Core 2 QuadQ6600 processor running at 2.4Ghz. The GPU usedin our experiments is the Nvidia GeForce GTX 470graphic card, which has 448 stream processors runningat 1.215GHz with 1280MB of GDDR5 memory runningat 1,674 MHz.

5.1 Decoder BER Performance

Our decoder can divide a codeword into P sub-blocks.Since our decoder processes eight stages in parallel tocompute LLRs, we support a P value if the length ofthe corresponding sub-blocks is divisible by eight. Weexpect that the number of sub-blocks per codeword

J Sign Process Syst

affects the overall decoder BER performance as largerP introduces more edge effects.

For our simulation, the host computer first gen-erates random 3GPP LTE Turbo codewords. AfterBPSK modulation, input symbols are passed throughthe channel with AWGN noise, the host generates LLRvalues based on the received symbol which are fedinto the Turbo decoder kernel running on the GPU.For these experiments, we tested our decoder with thefollowing P values, P = 1, 32, 64, 96, 128 for a 3GPPLTE Turbo code with N = 6144. In addition, we testedboth Full-log-MAP as well as Max-log-MAP with thedecoder performing six decoding iterations.

Figure 5 shows the BER performance of our decoderusing Full-log-MAP, while Fig. 6 shows the BER per-formance of our decoder using Max-log-MAP.

In both cases, BER of the decoder becomes worse aswe increase P. The BER performance of the decoderis significantly better when Full-log-MAP is used. Fur-thermore, we see that larger P can offer reasonable per-formance. For example, when P = 96, where each sub-block is only 64 stages long, the decoder provides BERperformance that is within 0.1dB of the performanceof the optimal case (P = 1). For parallelism of 32, thedecoder provides BER performance that is close to theoptimal case.

5.2 Decoder Throughput

5.2.1 Maximum Throughput

The value P affects the throughput performance asit controls the number of thread-blocks spawned atruntime. To find the maximum throughput this de-

0 0.1 0.2 0.3 0.4 0.5 0.6 0.710−7

10−6

10−5

10−4

10−3

10−2

10−1

Eb/N

o [dB]

Bit

Err

or R

ate

(BE

R)

P=1P=32P=64P=96P=128

Figure 5 BER performance (BPSK, Full-log-MAP).

0 0.1 0.2 0.3 0.4 0.5 0.6 0.710−7

10−6

10−5

10−4

10−3

10−2

10−1

100

Eb/N

o [dB]

Bit

Err

or R

ate

(BE

R)

P=1P=32P=64P=96P=128

Figure 6 BER performance (BPSK, Max-log-MAP).

coder can achieve, we use an extremely large workload,a batch of 2,048 codewords, to ensure there are asufficient number of thread-blocks for all possible Pvalues. As the decoding time is linearly dependent onthe number of trellis stages traversed, varying P andK do not significantly affect the decoder throughputprovided there is a sufficient workload to keep thecores busy. The decoder’s maximum throughput onlydepends on the number of iterations performed, max∗and the interleaver method used. We vary these para-meters and measure throughput of the decoder usingevent management in the CUDA runtime API. Thethroughput of the decoder is summarized in Table 3.We see that the throughput of the decoder is inverselyproportional to the number of iterations performed.The throughput of the decoder after m iterations canbe approximated as T0/m, where T0 is the through-put of the decoder after one iteration. Although thethroughput of Full-log-MAP is slower than Max-log-MAP as expected, the difference is small. However,Full-log-MAP provides significant BER performanceimprovement.

Table 3 Maximum decoder throughput.

Iteration Max-log-MAP Full-log-MAP(Mbps) (Mbps)

1 95.36 89.792 61.08 57.143 44.99 42.074 35.57 33.145 29.45 27.546 25.13 23.317 21.92 20.268 19.41 18.009 17.44 16.19

J Sign Process Syst

0 20 40 60 80 100 120 1400

100

200

300

400

500

600

P

N

Max−log−MAPFull−log−MAP

Figure 7 Number of codewords (N) versus number of sub-blocks (P).

5.3 Number of Sub-blocks vs. Number of Codewords

In the previous section, we found the maximum de-coder throughput by feeding a very large workloadinto the decoder. For workloads with fewer codewords,P affects the throughput performance as P controlsthe number of thread-blocks spawned at runtime. Aworkload of N codewords will spawn N P

16 thread-blockssince each thread-block processes 16 sub-blocks for16 codewords at the same time. As the GPU runsmany concurrent threads to keep the SM busy and hidestalls through thread switching, we expect that largerP configurations, which spawn more thread-blocks percodeword, will require smaller N to approach maxi-mum throughput.

To show how P affects the number of codewordsrequired to achieve high throughput, we set a targetthroughput and vary N in steps of 32 for various valuesof P until the decoder’s throughput exceeded the targetthroughput. In these experiments, we set K = 6,144,the number of decoding iterations to 5. For Max-log-MAP decoder, we set a target throughput of 27Mbps.Similarly, we set a target throughput of 24Mbps forFull-log-MAP.

As shown in Fig. 7, the trends are similar for bothcases. As expected, as a larger P spawns more thread-blocks per codeword, larger P offers better throughputperformance. There is a trade-off between decodinglatency and error correction performance. Althoughlarger P offers lower latency, larger P provides poorererror correction performance. Simulations show thatthe case of P = 32 seems to provides balanced per-

formance. This particular configuration provides gooderror correction performance while requiring a reason-able size workload to achieve high throughput.

5.4 Throughput with Early Termination

To accelerate the decoding process, we implement earlytermination by using the average LLR rule accordingto Algorithm 4. The computation of average LLR isperformed in the CUDA kernel. As the cost of thetag checking is small, tag checking is done in the hostcode. A simulation-based analysis is performed to de-termine a threshold value. In these simulations, theaverage LLR values are computed when the decodingprocess converges to the correct codeword. Based onthe simulation results, a threshold of T = 40 is selectedto guarantee that the BER is below 10−5. To get betterBER performance, a higher threshold T can be used.

Figure 8 shows the throughput results when theearly termination scheme is employed. The maximumnumber of iterations is set to 16. As the SNR goeshigher, the average number of iterations needed toreach the given BER level is reduced, so the decodingthroughput is increased. The simulation results alsoshow that the throughput for Eb/N0 = 0.5 dB is higherthan that for Eb/N0 = 0 dB although their averagenumber of iterations are the same (both are 16). Thisresult matches our expectation for the tag-based earlytermination algorithm. As mentioned in Section 4.4,the tag-based early termination algorithm stops thedecoding process for the already converged codeword,so even with the same average number of iterations theamount of computations is significantly reduced underthese circumstances.

0 0.5 1 1.5 2 2.5 310

15

20

25

30

35

40

45

Eb/N

0 [dB]

Thr

ough

put [

Mbp

s]

0 0.5 1 1.5 2 2.5 32

4

6

8

10

12

14

16A

ver

# of

iter

atio

nsAver # of iterationsThroughput

Figure 8 Throughput and average number of iterations when theearly termination scheme is used.

J Sign Process Syst

5.5 Architecture Comparison

Table 4 compares our decoder with other program-mable Turbo decoders. Compared to the general pur-pose processors and the multi-core DSP based solutionsfrom [5, 24–27], our decoder with P = 32 compares fa-vorably in terms of throughput and BER performance.For example, compared to a custom designed SIMDprocessor from [5], our solution shows both a flexibilityadvantage by supporting both Full-log-MAP and Max-log-MAP algorithms and a throughput advantage bysupporting 15 times the data rate. This is expectedas our device has significantly more computationalresources than general purpose processors and multi-core DSPs. In addition, we can support both the Full-log-MAP algorithm and the sub-optimal Max-log-MAPalgorithm while most other solutions only support thesub-optimal Max-log-MAP algorithm.

There are two recent papers on Turbo decoder onGPU [14, 15]. Both of these implementations try toincrease computational throughput by reducing devicememory accesses by saving α values in shared memory.However, the amount of shared memory per SM islimited. As a result, we need to divide a codewordinto many sub-blocks to reduce the amount of sharedmemory required by each thread-block. Dividing along codeword into many small sub-blocks improvesthroughput but reduces the error correction perfor-mance. An alternative is to divide a long codewordinto a few sub-blocks. This requires a large amountof shared memory per thread-block. As a result, wecannot pack multiple sub-blocks in a thread-block andcannot have many concurrent threads to hide pipelinestalls, leading to significant horizontal and verticalwaste which reduce decoder throughput. In [14], wekept the design to eight threads per thread-block, whichsupports sub-block length up to 192 stages. As theunderlying instructions are 32 wide SIMD instructions,cores are used only at most 1

4 th of the time with thisdesign.

In this paper, we took a more a more balancedapproach to shared memory usage. Since α values

are stored in device memory and fetched into sharedmemory when needed, shared memory is not a lim-itation and is not dependent on P. As a result, wecan spawn more concurrent threads to hide stalls andpack multiple sub-blocks in a thread-block to meet theSIMD instruction width. In this paper, we pack multiplesub-blocks from 16 codewords onto the same thread-block. We have 128 threads per thread-block whichcan fully utilize the width of the SIMD instructions,minimizing vertical waste. As a result, our presentsolution is significantly faster while requiring fewernumber of sub-blocks per codeword to achieve highperformance.

To understand the impact of architecture change andcode redesign between [14] and this paper, we bench-marked our original max-log-MAP decoder in [14] onNvidia GTX470 for 5 decoding iterations. For P = 96,we achieved a throughput of 11.05 Mbps. Comparedto the throughput performance of our original designon Telsa C1060, the improvement is approximately twotimes faster. The improvement is expected as thereare 1.87 times more cores on Nvidia GTX470 andthe introduction of L1 and L2 cache. The throughputperformance of the proposed design on GTX470 isapproximately 2.67 times faster than the original designon GTX470. This reflects the improvement we achievedwith the redesign. Although our new design packsmultiple sub-blocks to meet the SIMD instructionwidth, we do not achieve four times the throughput.This is due to two reasons. First, although we fullyutilize the SIMD instruction width, the number of in-structions needed is not four times smaller than thenumber of instructions needed in the original design.Compared to our original implementation, the numberof instructions for our new design increase as extra loadand store instructions are needed to move data betweenshared memory to device memory. Using the profiler,we noticed that the number of issued instructions ofthe new design is only 46.5% of the original design.Second, as data are fetched from device memory inthe proposed implementation. There are cache misseswhich increase the execution time.

Table 4 Our GPU baseddecoder vs otherprogrammable Turbodecoders.

Work Architecture MAP algorithm Throughput Iter.

[24] Intel Pentium 3 Log-MAP and Max-log-MAP 366 Kbps/51 Kbps 1[25] Motorola 56603 Max-log-MAP 48.6 Kbps 5[25] STM VLIW DSP Log-MAP 200 Kbps 5[26] TigerSHARC DSP Max-log-MAP 2.399 Mbps 4[27] TMS320C6201 DSP Max-log-MAP 500 Kbps 4[5] 32-wide SIMD Max-log-MAP 2.08 Mbps 5[15] Nvidia C1060 Max-log-MAP 2.1 Mbps 5[14] Nvidia C1060 Log-MAP and Max-log-MAP 6.77/5.2 Mbps 5This work Nvidia GTX470 Log-MAP and Max-log-MAP 29.45/27.54 Mbps 5

J Sign Process Syst

6 Conclusion

In this paper, we presented a 3GPP LTE compliantTurbo decoder implemented on GPU. We divide theworkload across cores on the GPU by dividing thecodeword into many sub-blocks to be decoded in par-allel and by decoding multiple codewords at the sametime. In addition, we improve efficiency by allowinga thread-block to decode multiple codeword at thesame time. We use shared memory to speed up devicememory access. However, we do not store all immedi-ate data on-chip to increase the number of concurrentlyrunning threads. The implementation also ensures com-putation is completely parallel for each sub-block. Asdifferent sub-block sizes can lead to BER performancedegradation, we presented how both BER performanceand throughput are affected by sub-block size. Weshow that our decoder provides high throughput eventhough the Full-log-MAP algorithm is used. This workwill enable the implementation of a high throughputdecoder completely in software on a GPU.

Acknowledgements This work was supported in part by Rene-sas Mobile, Texas Instruments, Xilinx, and by the US NationalScience Foundation under grants CNS-0551692, CNS-0619767,EECS-0925942 and CNS-0923479.

References

1. Berrou, C., Glavieux, A., & Thitimajshima, P. (1993). NearShannon limit error-correcting coding and decoding: Turbo-codes. In IEEE international conference on communication.

2. Garrett, D., Xu, B., & Nicol, C. (2001). Energy efficient turbodecoding for 3G mobile. In International symposium on lowpower electronics and design (pp. 328–333). ACM.

3. Bickerstaff, M., Davis, L., Thomas, C., Garrett, D., & Nicol,C. (2003). A 24Mb/s Radix-4 LogMAP turbo decoder for3GPP-HSDPA mobile wireless. In IEEE Int. Solid-State Cir-cuit Conf. (ISSCC).

4. Shin, M., & Park, I. (2007). SIMD Processor-based turbodecoder supporting multiple third-generation wireless stan-dards. IEEE Transactions on Very Large Scale Integration(VLSI) Systems, 15, 801–810.

5. Lin, Y., Mahlke, S., Mudge, T., Chakrabarti, C., Reid, A.,& Flautner, K. (2006). Design and implementation of turbodecoders for software defined radio. In IEEE workshop onsignal processing design and implementation (SIPS).

6. Sun, Y., Zhu, Y., Goel, M., & Cavallaro, J. R. (2008).Configurable and scalable high throughput turbo decoder ar-chitecture for multiple 4G wireless standards. In IEEE inter-national conference on Application-Specif ic Systems, Archi-tectures and Processors (ASAP) (pp. 209–214).

7. Salmela, P., Sorokin, H., & Takala, J. (2008). A pro-grammable Max-Log-MAP turbo decoder implementation.Hindawi VLSI Design (pp. 636–640).

8. Wong, C.-C., Lee, Y.-Y., & Chang, H.-C. (2009). A 188-size2.1 mm2 Reconfigurable turbo decoder chip with parallel

architecture for 3GPP LTE system. In Symposium on VLSIcircuits (pp. 288–289).

9. Amiri, K., Sun, Y., Murphy, P., Hunter, C., Cavallaro, J.R.,& Sabharwal, A. (2007). WARP, a unified wireless networktestbed for education and research. In MSE ’07: Proceedingsof the 2007 IEEE international conference on microelectronicsystems education.

10. Kim, J., Hyeon, S., & Choi, S. (2010). Implementation of anSDR system using graphics processing unit. IEEE Communi-cations Magazine, 48(3), 156–162.

11. Wu, M., Sun, Y., & Cavallaro, J. R. (2009). Reconfigurablereal-time MIMO detector on GPU. In IEEE 43rd Asilo-mar conference on signals, systems and computers (ASILO-MAR’09).

12. Nylanden, T., Janhunen, J., Silvén, O., & Juntti, M. J. (2010).A GPU implementation for two MIMO-OFDM detectors.In International conference on embedded computer systems(SAMOS) (pp. 293–300).

13. Falcão, G., Silva, V., & Sousa, L. (2009). How GPUs can out-perform ASICs for fast LDPC decoding. In ICS ’09: Proceed-ings of the 23rd international conference on supercomputing(pp. 390–399).

14. Wu, M., Sun, Y., & Cavallaro, J. (2010). Implementation ofa 3GPP LTE turbo decoder accelerator on GPU. In SignalProcessing Systems (SIPS) (pp. 192–197).

15. Lee, D., Wolf, M., & Kim, H. (2010). Design space explo-ration of the turbo decoding algorithm on GPUs. In Interna-tional conference on compilers, architectures and synthesis forembedded systems (pp. 214–226).

16. NVIDIA Corporation, CUDA compute unified devicearchitecture programming guide (2008). Available:http://www.nvidia.com/object/cuda_develop.html

17. Bahl, L., Cocke, J., Jelinek, F., & Raviv, J. (1974). Op-timal decoding of linear codes for minimizing symbol er-ror rate. IEEE Transactions on Information Theory, IT-20,284–287.

18. Naessens, F., Bougard, B., Bressinck, S., Hollevoet, L.,Raghavan, P., der Perre, L. V., & Catthoor, F. (2008). Aunified instruction set programmable architecture for multi-standard advanced forward error correction. In IEEE work-shop on Signal Processing Systems(SIPS).

19. Hagenauer, J., Offer, E., & Papke, L. (1996). Iterative decod-ing of binary block and convolutional codes. IEEE Transac-tions on Information Theory, 42, 429–445.

20. Shao, S. L. R., & Fossorier, M. (1996). Two simple stoppingcriteria for turbo decoding. IEEE Transactions on Informa-tion Theory, 42, 429–445.

21. Matache, A., Dolinar, S., & Pollara, F. (2000). Stoppingrules for turbo decoders. In JPL TMO Progress Report(pp. 42–142).

22. Sun, J., & Takeshita, O. (2005). Interleavers for turbo codesusing permutation polynomials over integer rings. IEEETransactions on Information Theory, 51, 101–119.

23. Robertson, P., Villebrun, E., & Hoeher, P. (1995). A com-parison of optimal and sub-optimal MAP decoding algorithmoperating in the log domain. In IEEE Int. Conf. Commun.(pp. 1009–1013).

24. Valenti, M., & Sun, J. (2001). The UMTS turbo code anda efficient decoder implementation suitable for software-defined radios. International Journal of Wireless InformationNetworks, 8(4), 203–215.

25. Michel, H., Worm, A., Munch, M., & Wehn, N. (2002).Hardware software trade-offs for advanced 3G channelcoding. In Proceedings of design, automation and test inEurope.

J Sign Process Syst

26. Loo, K., Alukaidey, T., & Jimaa, S. (2003). High performanceparallelised 3GPP turbo decoder. In IEEE personal mobilecommunications conference (pp. 337–342).

27. Song, Y., Liu, G., & Yang, H. (2005). The implementation ofturbo decoder on DSP in W-CDMA system. In Internationalconference on wireless communications, networking and mo-bile computing (pp. 1281–1283).

Michael Wu received his B.S. degree from Franklin W. OlinCollege in May of 2007 and his M.S. degree from Rice Universityin May of 2010, both in Electrical and Computer Engineering. Heis currently a Ph.D candidate in the E.C.E department at RiceUniversity. His research interests are wireless algorithms, soft-ware defined radio on GPGPU and other parallel architectures,and high performance wireless receiver designs.

Yang Sun received the B.S. degree in Testing Technology &Instrumentation in 2000, and the M.S. degree in InstrumentScience & Technology in 2003, both from Zhejiang University,Hangzhou, China. From 2003 to 2004, he worked at S3 GraphicsCo. Ltd. as an ASIC design engineer, developing 3D Graph-ics Processors (GPU) for computers. From 2004 to 2005, heworked at Conexant Systems Inc. as an ASIC design engineer,developing Video Decoders for digital satellite-television set-topboxes (STBs). He is currently a Ph.D student in the Depart-ment of Electrical and Computer Engineering at Rice University,Houston, Texas. His research interests include parallel algo-rithms and VLSI architectures for wireless communication sys-tems, especially forward-error correction (FEC) systems. Hereceived the 2008 IEEE SoC Conference Best Paper Award, the2008 IEEE Workshop on Signal Processing Systems Best PaperAward (Bob Owens Memory Paper Award), and the 2009 ACMGLSVLSI Best Student Paper Award.

Guohui Wang received his B.S. in Electrical Engineering fromPeking University, Beijing, China, in 2005, and M.S. in ComputerScience from Institute of Computing Technology, Chinese Acad-emy of Sciences, Beijing, China, in 2008. Currently, he is a Ph.D.student in Department of Electrical and Computer Engineering,Rice University, Houston, Texas. His research interests includeVLSI signal processing for wireless communication systems andparallel signal processing on GPGPU.

Joseph R. Cavallaro received the B.S. degree from the Univer-sity of Pennsylvania, Philadelphia, Pa, in 1981, the M.S. degreefrom Princeton University, Princeton, NJ, in 1982, and the Ph.D.degree from Cornell University, Ithaca, NY, in 1988, all in elec-trical engineering. From 1981 to 1983, he was with AT&T BellLaboratories, Holmdel, NJ. In 1988, he joined the faculty ofRice University, Houston, TX, where he is currently a Professorof electrical and computer engineering. His research interestsinclude computer arithmetic, VLSI design and microlithogra-phy, and DSP and VLSI architectures for applications in wire-less communications. During the 1996–1997 academic year, heserved at the National Science Foundation as Director of thePrototyping Tools and Methodology Program. He was a NokiaFoundation Fellow and a Visiting Professor at the Universityof Oulu, Finland in 2005 and continues his affiliation thereas an Adjunct Professor. He is currently the Director of theCenter for Multimedia Communication at Rice University. Heis a Senior Member of the IEEE. He was Co-chair of the 2004Signal Processing for Communications Symposium at the IEEEGlobal Communications Conference and General/Program Co-chair of the 2003, 2004, and 2011 IEEE International Conferenceon Application-Specific Systems, Architectures and Processors(ASAP).