implementation of dsp and communication...

EURASIP Journal on Applied Signal Processing

Implementation of DSPand Communication Systems

Guest Editors: Yuke Wang and Yu Hen Hu


Guest Editors: Yuke Wang and Yu Hen Hu


Copyright © 2002 Hindawi Publishing Corporation. All rights reserved.

This is a special issue published in volume 2002 of “EURASIP Journal on Applied Signal Processing.” All articles are open accessarticles distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproductionin any medium, provided the original work is properly cited.

Editor-in-ChiefK. J. Ray Liu, University of Maryland, College Park, USA

Associate EditorsKiyoharu Aizawa, Japan Jiri Jan, Czech Antonio Ortega, USAGonzalo Arce, USA Shigeru Katagiri, Japan Mukund Padmanabhan, USAJaakko Astola, Finland Mos Kaveh, USA Ioannis Pitas, GreeceMauro Barni, Italy Bastiaan Kleijn, Sweden Raja Rajasekaran, USASankar Basu, USA Ut Va Koc, USA Phillip Regalia, FranceShih-Fu Chang, USA Aggelos Katsaggelos, USA Hideaki Sakai, JapanJie Chen, USA C. C. Jay Kuo, USA William Sandham, UKTsuhan Chen, USA S. Y. Kung, USA Wan-Chi Siu, Hong KongM. Reha Civanlar, USA Chin-Hui Lee, USA Piet Sommen, The NetherlandsTony Constantinides, UK Kyoung Mu Lee, Korea John Sorensen, DenmarkLuciano Costa, Brazil Y. Geoffrey Li, USA Michael G. Strintzis, GreeceIrek Defee, Finland Heinrich Meyr, Germany Ming-Ting Sun, USAEd Deprettere, The Netherlands Ferran Marques, Spain Tomohiko Taniguchi, JapanZhi Ding, USA Jerry M. Mendel, USA Sergios Theodoridis, GreeceJean-Luc Dugelay, France Marc Moonen, Belgium Yuke Wang, USAPierre Duhamel, France José M. F.Moura, USA Andy Wu, TaiwanTariq Durrani, UK Ryohei Nakatsu, Japan Xiang-Gen Xia, USASadaoki Furui, Japan King N. Ngan, Singapore Zixiang Xiong, USAUlrich Heute, Germany Takao Nishitani, Japan Kung Yao, USAYu Hen Hu, USA Naohisa Ohta, Japan

Contents

Editorial, Yuke Wang and Yu Hen HuVolume 2002 (2002), Issue 9, Pages 877-878

Lightweight Floating-Point Arithmetic: Case Study of Inverse Discrete Cosine Transform, Fang Fang,Tsuhan Chen, and Rob A. RutenbarVolume 2002 (2002), Issue 9, Pages 879-892

High-Level Synthesis of DSP Applications Using Adaptive Negative Cycle Detection,Nitin Chandrachoodan, Shuvra S. Bhattacharyya, and K. J. Ray LiuVolume 2002 (2002), Issue 9, Pages 893-907

Design and DSP Implementation of Fixed-Point Systems, Martin Coors, Holger Keding, Olaf Lüthje,and Heinrich MeyrVolume 2002 (2002), Issue 9, Pages 908-925

Partitioning and Scheduling DSP Applications with Maximal Memory Access Hiding, Zhong Wang,Edwin Hsing-Mean Sha, and Yuke WangVolume 2002 (2002), Issue 9, Pages 926-935

P-CORDIC: A Precomputation Based Rotation CORDIC Algorithm, Martin Kuhlmannand Keshab K. ParhiVolume 2002 (2002), Issue 9, Pages 936-943

Frequency Spectrum Based Low-Area Low-Power Parallel FIR Filter Design, Jin-Gyun Chungand Keshab K. ParhiVolume 2002 (2002), Issue 9, Pages 944-953

Low-Complexity Versatile Finite Field Multiplier in Normal Basis, Hua Li and Chang Nian ZhangVolume 2002 (2002), Issue 9, Pages 954-960

A Reduced-Complexity Fast Algorithm for Software Implementation of the IFFT/FFT in DMTSystems, Tsun-Shan Chan, Jen-Chih Kuo, and An-Yeu (Andy) WuVolume 2002 (2002), Issue 9, Pages 961-974

A DSP Based POD Implementation for High Speed Multimedia Communications, Chang Nian Zhang,Hua Li, Nuannuan Zhang, and Jiesheng XieVolume 2002 (2002), Issue 9, Pages 975-980

Wavelet Kernels on a DSP: A Comparison between Lifting and Filter Banks for Image Coding,Stefano Gnavi, Barbara Penna, Marco Grangetto, Enrico Magli, and Gabriella OlmoVolume 2002 (2002), Issue 9, Pages 981-989

AVSynDEx: A Rapid Prototyping Process Dedicated to the Implementation of Digital ImageProcessing Applications on Multi-DSP and FPGA Architectures, Virginie Fresse, Olivier Déforges,and Jean-François NezanVolume 2002 (2002), Issue 9, Pages 990-1002

EURASIP Journal on Applied Signal Processing 2002:9, 877–878c© 2002 Hindawi Publishing Corporation

Editorial

Yuke WangDepartment of Computer Science, Box 830688, MS EC 31, University of Texas at Dallas,Richardson, TX 75083-0688, USAEmail: [email protected]

Yu Hen HuDepartment of Electrical and Computer Engineering, University of Wisconsin-Madison,Madison, WI 53706-1691, USAEmail: [email protected]

The telecommunications, wireless communications, multi-media, and consumer electronics industries are witnessinga rapid evolution toward integrating complete systems on asingle chip. Single-chip systems will increasingly have botha hardware component as well as a software component,where the hardware component is of heterogeneous natureand may include a combination of ASIC, ASIP, digital sig-nal processors, reconfigurable processors, FPGAs, and gen-eral processors. The architecture of digital signal proces-sors has taken many new directions including VLIW, su-perscalar, SIMD, and more. The choice of the architecturestyles and the hardware/software combination are deter-mined by tradeoffs among costs, performance, power, timeto market, and flexibility. Furthermore, the boundary be-tween hardware and software has been blurred, while sys-tem design is characterized by ever-increasing complexitythat has to be implemented within reduced time and re-sulting in minimum costs. Therefore computer-aided de-sign tools that facilitate easy design process are of essentialimportance.

The first paper proposes a lightweight floating pointarithmetic, a family of customizable floating-point dataformats, which bridges the design gap between softwareand hardware. The effectiveness of the proposed schemeis demonstrated using the inverse discrete cosine trans-form in the context of video coding. Such flexible data for-mat will find applications beyond multimedia in areas suchas wireless communication where a wide range of preci-sion/power/speed/area tradeoffs can be made.

The second paper considers negative cycle detection ina weighted directed graph in the context of high-level syn-thesis for DSP systems. The paper introduces the concept ofadaptive negative cycle detection and demonstrates the ap-plication of the technique for problems such as performanceanalysis and design space exploration in DSP applications.

The third paper introduces a design environmentFRIDGE, which supports transformation of signal process-ing algorithms coded in floating-point to a fixed-point rep-resentation. FRIDGE also provides a direct link to DSP im-plementation by processor specific C-code generation.

The fourth paper presents a technique useful for efficientDSP processor compiler design, which reduces the CPU idletime due to the long memory access latency. The techniqueexplores the instruction level parallelism among instructionsof typical DSP applications.

The next three papers deal with ASIC design of variousimportant components such as CORDIC algorithm, FIR fil-ter, and multiplication in GF(2n). CORDIC algorithm hasimportant applications in Hartley transform, FFT, and DCT.The fifth paper introduces a novel CORDIC algorithm anda novel architecture resulting in the least delay. The sixth pa-per introduces an efficient parallel FIR filter with a new look-ahead quantization algorithm. Finite field GF(2n) is of greatinterests for cryptosystems and the seventh paper introducesa low complexity pipeline multiplier for GF(2n).

The next three papers discuss efficient implementa-tion on DSP processors for applications in discrete multi-tone (DMT) communication system, high-speed multimediacommunication systems, and image coding. The 512-pointIFFT/FFT is a modulation/demodulation kernel in the ADSLsystems, and an efficient fast algorithm together with its DSPprocessor based implementation for IFFT/FFT is derived inthe eighth paper. The nineth paper introduces an implemen-tation of point-of-deployment security module on DSP pro-cessor (TMS320C6211). The tenth paper develops waveletengines implemented in DSP platform.

Finally, our last paper presents a full rapid prototypingprocess by means of existing academic, commercial CADtools and platforms targeting an architecture that combinesmulti-DSP with an FPGA.

878 EURASIP Journal on Applied Signal Processing

Overall, we have covered several areas in this special issue:computer-aided design environment, framework, and toolsto facilitate the design of complex communication and DSPsystems, ASIC based implementation of important compo-nents in communication and DSP systems, DSP processorbased implementation, and integration of current tools. Wethank the authors, reviewers, the publisher, the editorialcommittee, and the EIC, for the tremendous amount of effortthey put into this special issue to make it a success. We believethe readers will find the results presented in this special issueuseful for their own design and implementation problems.

Yuke WangYu Hen Hu

Yuke Wang received his B.S. degree fromthe University of Science and Technologyof China, Hefei, China, in 1989, the M.S.and the Ph.D. degrees from the Universityof Saskatchewan, Canada, in 1992 and 1996,respectively. He has held faculty positions atConcordia University, Canada, and FloridaAtlantic University, Florida, USA. Currentlyhe is an Assistant Professor at the ComputerScience Department, University of Texas atDallas. He has also held visiting assistant professor positions at theUniversity of Minnesota, the University of Maryland, and the Uni-versity of California at Berkeley. Dr. Wang is currently an Editorof IEEE Transactions on Circuits and Systems, Part II, an Editor ofIEEE Transactions on VLSI Systems, an Editor of EURASIP Journalon Applied Signal Processing, and a few other journals. Dr. Wang’sresearch interests include VLSI design of circuits and systems forDSP and communication, computer aided design, and computerarchitectures. During 1996–2001, he has published about 60 papersamong which about 20 papers are in IEEE/ACM Transactions.

Yu Hen Hu is a faculty member at the De-partment of Electrical and Computer En-gineering, University of Wisconsin, Madi-son. He received BSEE from National Tai-wan University, and MSEE and Ph.D. de-grees from University of Southern Califor-nia. Prior to joining University of Wiscon-sin, he was a faculty member in the Elec-trical Engineering Department of South-ern Methodist University, Dallas, Texas. Hisresearch interests include multimedia signal processing, artificialneural networks, fast algorithms and design methodology for ap-plication specific micro-architectures, as well as computer-aideddesign tools. He has published more than 180 technical papers inthese areas. Dr. Hu is a fellow of IEEE. He is a former AssociateEditor (1988–1990) for the IEEE Transaction of Acoustic, Speech,and Signal Processing in the areas of system identification and fastalgorithms. He served as the secretary of the IEEE signal process-ing society (1996–1998), a board member at IEEE neural networkcouncil, and is currently a steering committee member of the In-ternational conference of Multimedia and Expo on behalf of IEEESignal Processing Society.


Lightweight Floating-Point Arithmetic: Case Studyof Inverse Discrete Cosine Transform

Fang FangDepartment of Electrical and Computer Engineering, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213, USAEmail: [email protected]

Tsuhan ChenDepartment of Electrical and Computer Engineering, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213, USAEmail: [email protected]

Rob A. RutenbarDepartment of Electrical and Computer Engineering, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213, USAEmail: [email protected]

Received 15 May 2001 and in revised form 9 May 2002

To enable floating-point (FP) signal processing applications in low-power mobile devices, we propose lightweight floating-pointarithmetic. It offers a wider range of precision/power/speed/area trade-offs, but is wrapped in forms that hide the complexity ofthe underlying implementations from both multimedia software designers and hardware designers. Libraries implemented in C++and Verilog provide flexible and robust floating-point units with variable bit-width formats, multiple rounding modes and otherfeatures. This solution bridges the design gap between software and hardware, and accelerates the design cycle from algorithmto chip by avoiding the translation to fixed-point arithmetic. We demonstrate the effectiveness of the proposed scheme usingthe inverse discrete cosine transform (IDCT), in the context of video coding, as an example. Further, we implement lightweightfloating-point IDCT into hardware and demonstrate the power and area reduction.

Keywords and phrases: floating-point arithmetic, customizable bit-width, rounding modes, low-power, inverse discrete cosinetransform, video coding.

1. INTRODUCTION

Multimedia processing has been finding more and more ap-plications in mobile devices. A lot of effort must be spentto manage the complexity, power consumption, and time-to-market of the modern multimedia system-on-chip (SoC)designs. However, multimedia algorithms are computation-ally intensive, rich in costly FP arithmetic operations ratherthan simple logic. FP arithmetic hardware offers a wide dy-namic range and high computation precision, yet occupieslarge fractions of total chip area and energy budget. There-fore, its application on mobile computing chip is highly lim-ited. Many embedded microprocessors such as the Stron-gARM [1] do not include an FP unit due to its unacceptablehardware cost.

So there is an obvious gap in multimedia system devel-opment: software designers prototype these algorithms us-ing high-precision FP operations, to understand how the al-gorithm behaves, while the silicon designers ultimately im-

plement these algorithms into integer-like hardware, that is,fixed-point units. This seemingly minor technical choice ac-tually creates severe consequences: the need to use fixed-point operations often distorts the natural form of the al-gorithm, forces awkward design trade-offs, and even intro-duces perceptible artifacts. Error analysis and word lengthoptimization of fixed-point 2D IDCT, inverse discrete cosinetransform, algorithm has been studied in [2], and a tool fortranslating FP algorithms to fixed-point algorithms was pre-sented in [3]. However, such optimization and translationare based on human knowledge of the dynamic range, preci-sion requirements, and the relationship between algorithm’sarchitecture and precision. This time-consuming and error-prone procedure often becomes the bottleneck of the entiresystem design flow.

In this paper, we propose an effective solution: light-weight FP arithmetic. This is essentially a family of cus-tomizable FP data formats that offer a wider range of preci-sion/power/speed/area trade-offs, but wrapped in forms that


hide the complexity of the underlying implementations fromboth multimedia algorithm designers and silicon designers.Libraries implemented in C++ and Verilog provide flexibleand robust FP units with variable bit-width formats, multi-ple rounding modes and other features. This solution bridgesthe design gap between software and hardware and acceleratethe design cycle from algorithm to chip. Algorithm design-ers can translate FP arithmetic computations transparentlyto lightweight FP arithmetic and adjust the precision eas-ily to what is needed. Silicon designers can use the standardASIC or FPGA design flow to implement these algorithmsusing the arithmetic cores we provide which consume lesspower than standard FP units. Manual translation from FPalgorithms to algorithms can be eliminated from the designcycle.

We test the effectiveness of our lightweight arithmetic li-brary using an H.263 video decoder. Typical multimedia ap-plications working with modest-resolution human sensorydata such as audio and video do not need the whole dynamicrange and precision that IEEE-standard FP offers. By reduc-ing the complexity of FP arithmetic in many dimensions,such as narrowing the bit-width, simplifying the roundingmethods and the exception handling, and even increasing theradix, we explore the impact of such lightweight arithmeticon both the algorithm performance and the hardware cost.

Our experiments show that for the H.263 video decoder,FP representation with less than half of the IEEE standardFP bit-width can produce almost the same perceptual videoquality. Specifically, only 5 exponent bits and 8 mantissa bitsfor a radix-2 FP representation, or 3-exponent bits and 11mantissa bits for a radix-16 FP representation are all we needto maintain the video quality. We also demonstrate that asimple rounding mode is sufficient for video decoding andoffers enormous reduction in hardware cost. In addition, weimplement a core algorithm in the video codec, IDCT, intohardware using the lightweight arithmetic unit. Comparedto a conventional 32-bit FP IDCT, our approach reduces thepower consumption by 89.5%.

The paper is organized as follows. Section 2 introducesbriefly the relevant background on FP and fixed-point rep-resentations. Section 3 describes our C++ and Verilog li-braries of lightweight FP arithmetic and the usage of the li-braries. Section 4 explores the complexity reduction we canachieve for IDCT built with our customizable library. Basedon the results in this section, we present the implementa-tion of lightweight FP arithmetic units and analyze the hard-ware cost reduction in Section 5. In Section 6, we comparethe area/speed/power of a standard FP IDCT, a lightweightFP IDCT, and a fixed-point IDCT. Concluding remarks fol-low in Section 7.

2. BACKGROUND

2.1. Floating-point representation versus fixed-pointrepresentation

There are two common ways to specify real numbers: FP andfixed-point representations. FP can represent numbers on an

1 8 23

s exp frac

FP value: (1)s2exp−bias · 1. frac (The leading 1 is implicit)0 ≤ exp ≤ 255, bias = 127

Figure 1: FP number representation.

16 16int frac

Figure 2: Fixed-point number representation.

exponential scale and is reputed for a wide dynamic range.The date format consists of three fields: sign, exponent, andfraction (also called mantissa), as shown in Figure 1.

Dynamic range is determined by the exponent bit-width,and resolution is determined by the fraction bit-width. Thewidely adopted IEEE single FP standard [4] uses an 8-bit ex-ponent that can reach a dynamic range roughly from 2−126

to 2127, and a 23-bit fraction that can provide a resolution of2exp−127 · 2−23, where exp stands for value represented by theexponent field.

In contrast, the fixed-point representation is on a uni-form scale, that is, essentially the same as the integer repre-sentation, except for the fixed radix point. For instance (seeFigure 2), a 32-bit fixed-point number with a 16-bit integerpart and a 16-bit fraction part can provide a dynamic rangeof 2−16 to 216 and a resolution of 2−16.

When prototyping algorithms with FP, programmers donot have to concern about dynamic range and precision, be-cause IEEE standard FP provides more than necessary formost general applications. Hence, float and double are stan-dard parts of programming languages like C++, and are sup-ported by most compilers. However, in terms of hardware,the arithmetic operations of FP need to deal with three parts(sign, exponent, fraction) individually, which adds substan-tially to the complexity of the hardware, especially in the as-pect of power consumption, while fixed-point operations arealmost as simple as integer operations. If the system has astringent power budget, then the application of FP units hasto be limited, and on the other hand, a lot of manual work isspent in implementing and optimizing the fixed-point algo-rithms to provide the necessary dynamic range and precision.

2.2. IEEE-754 floating-point standard

IEEE-754 is a standard for binary FP arithmetic [4]. Sinceour later discussion about the lightweight FP is based on thisstandard, we give a brief review of its main features in thissection.

Data format

The standard defines two primary formats, single precision(32 bits) and double precision (64 bits). The bit-widths ofthree fields and the dynamic range of single and double-precision FP are listed in Table 1.

Lightweight Floating-Point Arithmetic: Case Study of Inverse Discrete Cosine Transform 881

Table 1: IEEE FP number format and dynamic range.

Format Sign Exp Frac Bias Max Min

Single 1 8 23 127 3.4 · 1038 1.4 · 10−45

Double 1 11 52 1023 1.8 · 10308 4.9 · 10−324

1 8 23s 00000000 frac �= 0

Figure 3: Denormalized number format.

Rounding

The default rounding mode is round-to-nearest. If there isa tie for the two nearest neighbors, then it is rounded to theone with the least significant bit as zero. Three user selectablerounding modes are: round-toward +∞, round-toward −∞,and round-toward-zero (or called truncation).

Denormalization

Denormalization is a way to allow gradual underflow. Fornormalized numbers, because there is an implicit leading 1,the smallest positive value is 2−126 · 1.0 for single precision(it is not 2−127 · 1.0 because the exponent with all zeros isreserved for denormalized numbers). Values below this canbe represented by a so-called denormalized format (Figure 3)that does not have the implicit leading 1.

The value of the denormalized number is 2−126 · 0. fracand the smallest representable value is hence scaled downto 2−149 (2−126 · 2−23). Denormalization provides gracefuldegradation of precision for computations on very smallnumbers. However, it complicates hardware significantly andslows down the more common normalized cases.

Exception handling

There are five types of exceptions defined in the standard: in-valid operation, division by zero, overflow, underflow, andinexact. As shown in Figure 4, some bit patterns are reservedfor these exceptions. When numbers simply cannot be rep-resented, the format returns a pattern called NaN, not-a-number, with information about the problem. NaNs providean escape mechanism to prevent system crashes in case ofinvalid operations. In addition to assigning specific NaN bitpatterns, some status flags and trapping signals are used toindicate exceptions.

From the above review, we can see that IEEE standard FParithmetic has a strong capability to represent real numbersas accurately as possible and is very robust when exceptionsoccur. However, if the FP arithmetic unit is dedicated to aparticular application, the IEEE-mandated 32 or 64-bit longbit-width may provide more precision and dynamic rangethan needed, and many other features may be unnecessaryas well.

1 8 23

NaN s 11111111 frac �= 0

1 8 23Infinite s 11111111 frac == 0

NaN is assigned when some invalid operations occur, like(+∞) + (−∞), 0 ∗ ∞, 0/0, . . . etc.

Figure 4: Infinite and NaN representations.

3. CUSTOMIZABLE LIGHTWEIGHT FLOATING-POINTLIBRARY

The goal of our customizable lightweight FP library is toprovide more flexibility than IEEE standard FP in bit-width,rounding and exception handling. We created matched C++and Verilog FP arithmetic libraries that can be used dur-ing algorithm/circuit simulation and circuit synthesis. Withthe C++ library, software designers can simulate the algo-rithms with lightweight arithmetic and decide the minimalbit-width, rounding mode, and so forth, according to thenumerical performance. Then with the Verilog library, hard-ware designers can plug in the parameterized FP arithmeticcores into the system and synthesize it to the gate-level cir-cuit. Our libraries provide a way to move the FP designchoices (bit-width, rounding, . . .) upwards to the algorithmdesign stage and better predict the performance during earlyalgorithm simulation.

3.1. Easy-to-use C++ class Cmufloat for algorithmdesigners

Our lightweight FP class is called Cmufloat and imple-mented by overloading existing C++ arithmetic operators(+,−,∗, /, . . .). It allows direct operations, including assign-ment between Cmufloat and any C++ data types except char.The bit-width of Cmufloat varies from 1 to 32 includingsign, fraction, and exponent bits and is specified during thevariable declaration. Three rounding modes are supported:round-to-nearest, Jamming, and truncation, one of which ischosen by defining a symbol in an appropriate configurationfile. Explanation of our rounding modes is presented in detaillater. In Figure 5, we summarize the operators of Cmufloatand give some examples of using it.

Our implementation of lightweight FP offers two advan-tages. First, it provides a transparent mechanism to embedCmufloat numbers in programs. As shown in the example,designers can use Cmufloat as a standard C++ data type.Therefore, the overall structure of the source code can be pre-served and a minimal amount of work is spent in translatinga standard FP program to a lightweight FP program. Sec-ond, the arithmetic operators are implemented by bit-levelmanipulation, which carefully emulates the hardware imple-mentation. We believe the correspondence between softwareand hardware is more exact than previous work [5, 6]. Theseother approaches appear to have implemented the operatorsby simply quantizing the result of standard FP operations


Cmufloatdoublefloat =intshort

Cmufloat + == Cmufloatdouble − >=, > doublefloat ∗ <=, < floatint / ! = intshort short

(a) Operators with Cmufloat.

Cmufloat < 14, 5 > a = 0.5; // 14-bit fraction and 5-bit exponentCmufloat <> b = 1.5; // Default is IEEE-standard floatCmufloat < 18, 6 > c[2]; // Define an arrayfloat fa;

c[1] = a + b;fa = a ∗ b; // Assign the result to floatc[2] = fa + c[1]; // Operation between float and Cmufloatcout << c[2]; // I/O streamfunc(a); // Function call

(b) Examples of Cmufloat.

Figure 5: Operators and examples of Cmufloat.

into limited bits. This approach actually has more bit-widthfor the intermediate operations, while our approach guaran-tees that results of all operations, including the intermedi-ate results, are consistent with the hardware implementation.Hence, the numerical performance of the system during theearly algorithm simulation is more trustworthy.

3.2. Parameterized Verilog library for silicon designers

We provide a rich set of lightweight FP arithmetic units(adders, multipliers) in the form of parameterized Verilog.First, designers can choose implementations according to therounding mode and the exception handling. Then they canspecify the bit-width for the fraction and exponent by pa-rameters. With this library, silicon designers are able to sim-ulate the circuit at the behavioral level and synthesize it intoa gate-level netlist. The availability of such cores makes pos-sible a wider set of design trade-offs (power, speed, area, ac-curacy) for multimedia tasks.

4. REDUCING THE COMPLEXITY OF FLOATING-POINTARITHMETIC FOR A VIDEO CODEC IN MULTIPLEDIMENSIONS

Most multimedia applications process modest-resolutionhuman sensory data, which allows hardware implementationto use low-precision arithmetic computations. Our workaims to find out how much the precision and the dynamicrange can be reduced from the IEEE standard FP with-out perceptual quality degradation. In addition, the impactsof other features in IEEE standard, such as the roundingmode, denormalization, and the radix choice are also stud-ied. Specifically, we target the IDCT algorithm in an H.263video codec, since it is the only module that really uses FPcomputations in the codec, and also is common in many

other media applications, such as image processing, audiocompression, and so forth.

In Figure 6, we give a simplified diagram of a video codec.In the decoder side, the input compressed video is put intoIDCT after inverse quantization. After some FP computa-tions in IDCT, those DCT coefficients are converted to pixelvalues that are the differences between the previous frameand the current frame. The last step is to get the currentframe by adding up the outputs of IDCT and the previousframe after the motion compensation.

Considering the IEEE representation of FP numbers,there are five dimensions that we can explore in order toreduce the hardware complexity (Table 2). Accuracy versushardware cost trade-off is made in each dimension. In or-der to measure the accuracy quantitatively, we integrate theCmufloat IDCT into a complete video codec and measurethe PSNR (peak-signal-to-noise-ratio) of the decoded video,which reflects the decoded video quality,

PSNR = 10 log2552∑N

i=0

(pi − fi

)/N

, (1)

where N is the total number of pixels, pi stands for the pixelvalue decoded by the lightweight FP algorithm, and fi standsfor the reference pixel value of the original video.

4.1. Reducing the exponent and fraction bit-width

Reducing the exponent bit-width

The exponent bit-width determines the dynamic range. Us-ing a 5-bit exponent as an example, we derive the dynamicrange in Table 3. Complying with the IEEE standard, the ex-ponent with all 1s is reserved for infinity and NaN. With abias of 15, the dynamic range for a 5-bit exponent is from2−14 or 2−15 to 216, depending on the support for denormal-ization.

In order to decide the necessary exponent bit-width forour IDCT, we collected the histogram information of expo-nents for all the variables in the IDCT algorithm during avideo sequence decoding (see Figure 7). The range of theseexponents lies in [−22, 10], which is consistent with the the-oretical result in [7]. From the dynamic range analysis above,we know that a 5-bit exponent can almost cover such a rangeexcept when numbers are extremely small, while a 6-bit ex-ponent can cover the entire range. However, our experimentshows that 5-bit exponent is able to produce the same PSNRas an 8-bit exponent.

Reducing the fraction bit-width

Reducing the fraction bit-width is the most practical way tolower the hardware cost because the complexity of an integermultiplier is reduced quadratically with decreasing bit-width[8]. On the other hand, the accuracy is degraded when nar-rowing the bit-width. The influence of decreasing bit-widthon video quality is shown in Figure 8.

As we can see in the curve, PSNR remains almost con-stant across a rather wide range of fraction bit widths, whichmeans that the fraction width does not affect the decoded


TransmitDCT Q

IQ

IDCT

DMotion

compensation

Encoder

IQ

IDCT

DMotion

compensation

Cmufloatalgorithm

Decoder

Figure 6: Video codec diagram Q: quantizer, IQ: inverse quantizer, D: delay.

Table 2: Working dimensions for lightweight FP.

Dimension Description

Smaller exponent bit-width Reduce the number of exponent bits at the expense of dynamic range

Smaller fraction bit-width Reduce the number of fraction bits at the expense of precision

Simpler rounding mode Choose simpler rounding mode at the expense of precision

No support for denormalization Do not support denormalization at the expense of precision

Higher radix

Increase the implied radix from 2 to 16 for the FP exponent (higher radix FPneeds less exponent bits and more fraction bits than radix-2 FP, in order toachieve the comparable dynamic range and precision)

Table 3: Dynamic range of a 5-bit exponent.

Exponent Value of exp-bias Dynamic range

Biggest exponent 11110 16 216

Smallest exponent

(with support for 00001 −14 2−14

denormalization)

Smallest exponent

(no support for 00000 −15 2−15

denormalization)

video quality in this range. The cutoff point where PSNRstarts dropping is as small as 8 bits—this is about 1/3 ofthe fraction-width of an IEEE standard FP. The difference ofPSNR between an 8-bit fraction FP and 23-bit fraction is only0.22 dB, which is almost not perceptible to human eyes. Oneframe of the video sequence is compared in Figure 9. The topone is decoded by a full precision FP IDCT, and the bottomone is decoded by a 14-bit FP IDCT. From the perceptualquality point of view, it is hard to tell the major differencebetween these two.

From the above analysis, we can reduce the total bit-width from 32 bits to 14 bits (1-bit sign + 5-bit exponent +8-bit fraction), while preserving good perceptual quality. In

−26 −22 −18 −14 −10 −6 −2 2 6 10

Exponent

1

10

100

1000

10000

100000

1000000

Figure 7: Histogram of exponent value.

order to generalize our result, the same experiment is carriedon three other video sequences: Akiyo, Stefan, mobile, all ofwhich have around 0.2 dB degradation in PSNR when 14-bitFP is applied.

The relationship between video compression ratioand the minimal bit-width

For streaming video, the lower the bit rate, the worse is thevideo quality. The difference in bit rate is mainly caused bythe quantization step size during encoding. A larger step sizemeans coarser quantization and therefore worse video qual-ity. Considering the limited wireless network bandwidth andthe low display quality of mobile devices, a relatively low bitrate is preferred to transfer video in this situation. Hence, we


23 21 19 17 15 13 11 9 7 5

Fraction width

30

32

34

36

38

40

PSN

R(d

B)

Cutoff point

Figure 8: PSNR versus fraction width.

(a) One frame decoded by 32-bit FP1-bit sign + 8-bit exp + 23-bit fraction.

(b) One frame decoded by 14-bit FP1-bit sign + 5-bit exp + 8-bit fraction.

Figure 9: Video quality comparison.

want to study the relationship between compression ratio, orquantization step size and the minimal bit-width.

We compare the PSNR curves obtained by experimentswith different quantization step sizes in Figure 10. From thefigure, we can see that for a larger quantization step size, thePSNR is lower, but at the same time the curve drops moreslowly and the minimal bit-width can be reduced further.This is because coarse quantization can hide more computa-tion error under the quantization noise. Therefore, less com-putational precision is needed for the video codec using alarger quantization step size.

23 20 17 14 11 8 5Fraction width

25

30

35

40

PSN

R

Quantization step size = 4Quantization step size = 8Quantization step size = 16

Figure 10: Comparison of different quantization step.

23 21 19 17 15 13 11 9 7 5Fraction width

313233343536373839

PSN

R(d

B)

IntracodingIntercoding

Figure 11: Comparison of intracoding and intercoding.

The relationship between inter/intra-codingand the minimal bit-width

Intercoding refers to the coding of each video frame with ref-erence to the previous video frame. That is, only the differ-ence between the current frame and the previous frame iscoded, after motion compensation. Since in most videos, ad-jacent frames are highly correlated, intercoding provides veryhigh efficiency.

Based on the assumption that video frames have somecorrelation, for intercoding, the differences coded as the in-puts to DCT are typically much smaller than regular pixelvalues. Accordingly, the DCT coefficients, or inputs to IDCTare also smaller than those in intracoding. One property ofFP numbers is that representation error is smaller when thenumber to be represented is smaller. From Figure 11, thePSNR of intercoding drops more slowly than intracoding, orthe minimum bit-width of intercoding can be 1 bit less thanintracoding.

However, if the encoder uses the full precision FP IDCT,while the decoder uses the lightweight FP IDCT, then the er-ror propagation effect of intercoding cannot be eliminated.In that case, intercoding does not have the above advantage.

The results in this section demonstrates that some pro-


grams do not need the extreme precision and dynamic rangeprovided by IEEE standard FP. Applications dealing with themodest-resolution human sensory data can tolerate somecomputation error in the intermediate or even final results,while giving similar human perceptual results. Some exper-iments on other applications also agree with this assertion.We also applied Cmufloat to an MP3 decoder. The bit-widthcan be cut down to 14 bits (1-bit sign + 6-bit exponent +7-bit fraction) and the noise behind the music is still notperceptible. In the CMU Sphinx application, a speech rec-ognizer, 11-bit FP (1-bit sign + 6-bit exponent + 4-bit frac-tion) can maintain the same recognition accuracy as 32-bitFP. Such dramatic bit-width reduction offers enormous ad-vantage that may broaden the application of lightweight FPunits in mobile devices.

Finally, we note that the numerical analysis and preci-sion optimization in this section can be implemented in asemiautomated way by appropriate compiler support. Wecan extend an existing C++ compiler to handle lightweightarithmetic operations and assist the process of exploring theprecision trade-offs with less programmer intervention. Thiswill unburden designers from translating codes manuallyinto proper limited-precision formats.

4.2. Rounding modes

When an FP number cannot be represented exactly, or theintermediate result is beyond the allowed bit-width duringcomputation, then the number is rounded, introducing anerror less than the value of the least significant bit. Amongthe four rounding modes specified by the IEEE FP standard,round-to-nearest, round-to-(+∞), and round-to-(−∞) needan extra adder in the critical path, while round-toward-zerois the simplest in hardware, but the least accurate in preci-sion. Since round-to-nearest has the most accuracy, we im-plement it as our baseline of comparison. There is anotherclassical alternative mode that may have potential in both ac-curacy and hardware cost: von Neumann [9]. We will dis-cuss these three rounding modes (round-to-nearest, Jam-ming, round-toward-zero) in detail.

Round-to-nearest

In the standard FP arithmetic implementation, there arethree bits beyond the significant bits that are for intermediateresults [10] (see Figure 12). The sticky bit is the logical OR ofall bits thereafter.

These three bits participate in rounding in the followingway:

b000

b · · ·b011

Truncate these tail bits

b100 −→ If b is 1, add 1 to b

If b is 0, truncate the tail bits

b101

b · · ·b111

Add 1 to b

significant bit

guard bit

round bit

sticky bit

Figure 12: Guard/round/sticky bit.

b X X X

OR

b′

Do nothing

Add 1 to b

Do nothing

Truncate tail bits

bXXX

00000001· · ·0· · ·0111

1000

1001

1· · ·1111

Figure 13: Jamming rounding.

Round-to-nearest is the most accurate rounding mode,but needs some comparison logic and a carry-propagateadder in hardware. Further, since the rounding can actuallyincrease the fraction magnitude, it may require extra normal-ization steps which cause additional fraction and exponentcalculations.

Jamming

The rule for Jamming rounding is as follows: if b is 1, thentruncate those 3 bits, if b is 0, and there is a 1 among those3 bits, then add 1 to b, else if b and those 3 bits are all 0, thentruncate those 3 bits. Essentially, it is the function of an ORgate (see Figure 13).

Jamming is extremely simple as hardware, almost as sim-ple as truncation, but numerically more attractive for onesubtle but important reason. The rounding created by trun-cation is biased; the rounded result is always smaller thanthe correct value. Jamming, by sometimes forcing a 1 intothe least significant bit position, is unbiased. The magnitudeof Jamming errors is not different from truncation, but themean of errors is zero. This important distinction was recog-nized by von Neumann almost 50 years ago [9].

Round-toward-zero

The operation of round-toward-zero is just truncation. Thismode has no overhead in hardware, and it does not have tokeep 3 more bits for the intermediate results. So it is muchsimpler in hardware than the first two modes.

The PNSR curves for the same video sequence obtainedusing these three rounding modes are shown in Figure 14.Three rounding modes produce almost the same PSNR whenthe fraction bit-width is more than 8 bits. At the point of8 bits, the PSNR of truncation is about 0.2 dB worse thanthe other two. On the other hand, from the hardware point


17 16 15 14 13 12 11 10 9 8 7 6Fraction width

33

35

37

39

PSN

R(d

B)

Round-to-nearestTruncationJamming

Figure 14: Comparison of rounding modes.

Without denormalization

With denormalization

1.000 1.111

0.001 0.111

Figure 15: Denormalization.

of view, Jamming is much simpler than round-to-nearestand truncation is the simplest among these three modes.So trade-off will be made between quality and complexityamong the modes of Jamming and truncation. We will final-ize the choice of rounding mode during the hardware imple-mentation section.

4.3. Denormalization

The IEEE standard allows for a special set of non-normalizednumbers that represent magnitudes very close to zero. Weillustrate this in Figure 15 by an example of a 3-bit frac-tion. Without denormalization, there is an implicit 1 beforethe fraction, so the actual smallest fraction is 1.000, whilewith denormalization, the leading 1 is not enforced so thatthe smallest fraction is scaled down to 0.001. This mech-anism provides more precision for scientific computationwith small numbers, but for multimedia applications, espe-cially for video codec, do those small numbers during thecomputation affect the video quality?

We experimented on the IDCT with a 5-bit exponentCmufloat representation. 5 bits was chosen to ensure that nooverflow would happen during the computation. But fromthe histogram of Figure 7, there are still some numbers belowthe threshold of normalized numbers. That means if denor-malization is not supported, these numbers will be roundedto zero. However, the experiment shows that the PSNRs withand without denormalization are the same, which means thatdenormalization does not affect the decoded video quality atall.

4.4. Higher radix for FP exponent

The exponent of the IEEE standard FP is based on radix 2.Historically, there are also systems based on radix 16, for

example, the IBM 390 [11]. The advantage of radix 16 liesmainly in fewer types of shifting during prealignment andnormalization, which can reduce the shifter complexity inthe FP adder and multiplier. We will discuss this issue inSection 5.4 when we discuss hardware implementation inmore detail.

The potential advantage of a higher radix such as 16 isthat the smaller exponent bit-width is needed for the samedynamic range as the radix-2 FP, while the disadvantage isthat the larger fraction bit-width has to be chosen to main-tain the comparable precision. We analyze such features inthe following.

Exponent bit-width

The dynamic range represented by i-bit exponent is approx-imately from β2−(i−1)

to β2(i−1)(β is the radix). Assume we use

i-bit exponent for radix-2 FP and j-bit exponent for radix-16 FP. If they have the same dynamic range, then 22−(i−1) =162( j−1)

, or j = i−2. Specifically, if the exponent bit-width forradix-2 is 5, then only 3 bits are needed for the radix-16 FPto reach approximately the same range.

Fraction bit-width

The precision of an FP number is mainly determined by thefraction bit-width. But the radix for the exponent also plays arole, due to the way that normalization works. Normalizationensures that no number can be represented with two or morebit patterns in the FP format, thus maximizing the use of thefinite number of bit patterns. Radix-2 numbers are normal-ized by shifting to ensure a leading bit 1 in the most signif-icant fraction bit. IEEE format actually makes this implicit,that is, it is not physically stored in the number.

For radix 16, however, normalization means that the firstdigit of the fraction, that is, the most significant 4 bits afterthe radix point, is never 0000. Hence there are four bit pat-terns that can appear in the radix-16 fraction (see Table 4).In other words, the radix-16 fraction uses its available bitsin a less efficient way, because the leading zeros reduce thenumber of significant bits of precision. We analyze the lossof significant precisions bits in Table 4.

The significant bit-width of a radix-2 i-bit fraction is i+1,while for a radix-16 j-bit fraction, the significant bit-width isj, j − 1, j − 2, or j − 3, with possibility of 1/4, respectively.The minimum fraction bit-width of a radix-16 FP that canguarantee the precision not less than radix-2 must satisfy thefollowing inequality:

min{ j, j − 1, j − 2, j − 3} ≥ i + 1, (2)

so the minimum fraction bit-width is i + 4. Actually, it canprovide more precision than radix-2 FP since j, j − 1, j − 2are larger than i + 1.

From previous discussions, we know that 14-bit radix-2FP (1-bit sign + 5-bit exponent + 8-bit fraction) can producegood video quality. Moving to radix-16, two less exponentbits (3 bits) and four more fraction bits (12 bits), or 16 bitsfor total FP can guarantee the comparable video quality.


Table 4: Comparison of radix-2 to radix-16.

FP format Normal form of fraction Range of fraction Significant bits

Radix-2 (i-bit fraction) 1.xx· · ·xx 1 ≤ f < 2 i + 1

Radix-16 ( j-bit fraction)

.1xx· · · · ·x 1/2 ≤ f < 1 j

.01xx· · ·x 1/4 ≤ f < 1/2 j − 1

.001x· · ·x 1/8 ≤ f < 1/4 j − 2

.0001x··x 1/16 ≤ f < 1/8 j − 3

23 21 19 17 15 13 11 9 7 515

20

25

30

35

40

PSN

R(d

B)

Radix 16Radix 2

Cutoff point

Figure 16: Comparison of radix-2 and radix-16.

Table 5: 11-bit fraction for radix-16 is sufficient.

FP format PSNR

Radix-2 (8-bit fraction) 38.529




Since the minimum fraction bit-width is derived from theworst case analysis, it could be reduced further from the per-spective of average precision. Applying radix-16 Cmufloat tothe video decoder, we can see that the cutoff point actually is11 bits, not 12 bits for fraction width (Figure 16 and Table 5).

After discussion in each working dimension, we summa-rize the lightweight FP design choices for the H.263 videocodec in the following:

Data format: 14-bit radix-2 FP (5-bit exponent + 8-bitfraction) or 15-bit radix-16 FP (3-bit exponent + 11-bit frac-tion).

Rounding: Jamming or truncation.Denormalization: not supported.The final choice of data format and rounding mode are

made in Section 5 according to the hardware cost.In all discussions in this section, we use PSNR as a mea-

surement of the algorithm. However, we need to mentionthat there is an IEEE standard specifying the precision re-quirement for 8 × 8 DCT implementation [12]. In the stan-

dard, it has the following specification:

omse =∑7

i=0

∑7j=0

∑10000k=0 ek2(i, j)

64× 10000≤ 0.02 (3)

(where omse is the overall-mean-square-error, ek is the pixeldifference between reference and proposed IDCT, and i, j arethe position of the pixel in the 8× 8 block).

PSNR specification can be derived from omse: PSNR= 10 log10(2552/ omse) ≥ 65.1 dB, which is too tight forvideos/images displayed by mobile devices. The other reasonwe did not choose this standard is that it uses the uniformdistributed random numbers as input pixels that eliminatethe correlation between pixels and enlarge the FP computa-tion error. We also did experiments based on the IEEE stan-dard. It turns out that around 17-bit fraction is required tomeet all the constraints. From PSNR curves in this section,we know that PSNR almost keeps constant for fraction widthfrom 17 bits down to 9 bits. The experimental results sup-port our claim very well that IEEE standard specificationsfor IDCT is too strict for encoding/decoding real video se-quences.

5. HARDWARE IMPLEMENTATION OF LIGHTWEIGHTFP ARITHMETIC UNITS

FP addition and multiplication are the most frequent FP op-erations. A lot of work has been published about IEEE com-pliant FP adders and multipliers, focusing on reducing thelatency of the computation. In IBM RISC System/6000, lead-ing zero anticipator is introduced in the FP adder [13]. TheSNAP project proposed a two-path approach in the FP adder[14]. However, the benefit of these algorithms is not signif-icant and the penalty in area is not ignorable when the bit-width is very small. In this section, we present the structureof the FP adder and multiplier appropriate for narrow bit-width and study the impact of different rounding/exceptionhandling/radix schemes.

Our design is based on Synopsys Designware library andSTMicroelectronics 0.18 µm technology library. The area andlatency are measured on gate-level circuit by Synopsys De-signCompiler, and the power consumption is measured byCadence VerilogXL simulator and Synopsys DesignPower.


Prealignment Exceptiondetection

Fraction addition

Normalization

Rounding

Exception handling

(a) Adder.

Fractionmultiplication

Exponentaddition

Normalization

Rounding

Exception handling

(b) Multiplier.

Figure 17: Diagram of FP, the adder and multiplier.

5.1. Structure

As shown in Figure 17, we take the most straightforward top-level structure for the FP adder and multiplier. The tricks re-ducing the latency are not adopted because firstly the adderand multiplier can be accelerated easily by pipelining, andsecondly those tricks increase the area by a large percentagein the case of narrow bit width.

Shifter

The core component in the prealignment and normalizationis a shifter. There are three common architectures for shifter:

N-to-1 Mux shifter is appropriate when N is small.Logarithmic shifter [15] uses log(N) stages and each stage

handles a single, power-of-2 shifts. This architecture hascompact area, but the timing path is long when N is big.

Fast two-stage shifter is used in IBM RISC System/6000[16]. The first stage shifts (0, 4, 8, 12, . . .) bit positions, andthe second stage shifts (0, 1, 2, 3) bit position.

The comparison of these three architectures over differ-ent bit-widths is shown in Figure 18. It indicates that fornarrow bit-width (8 ∼ 16), logarithmic shifter is the bestchoice considering both of area and delay, but when the

Area

8 16 32Bit width

0

1000

2000

3000

4000

5000

6000

7000

Are

a(µ

m2)

N-to-1 MuxTwo stagesLogarithmic

Delay

8 16 32

Bit width

0

1

2

3

4

5

6

Del

ay(n

s)

N-to-1 MuxTwo stagesLogarithmic

Figure 18: Area and delay comparisons of three shifter structures.

bit-width is increased to a certain level, two-stage shifter be-come best.

5.2. Rounding

In our lightweight FP arithmetic operations, round-to-nearest is not considered because of the heavy hardware over-head. Jamming rounding demonstrates similar performanceas round-to-nearest in the example of video codec. But itstill has to keep three more bits in each stage in the FPadder, which becomes significant, especially for narrow bit-width cases. The other candidate is round-toward-zero be-cause its performance is close to Jamming rounding at thecutoff point. Table 6 shows the reduction in both of the areaand delay when changing rounding mode from Jamming toround-toward-zero. Since 15% reduction in the area of FPadder can be obtained, we finally choose truncation as therounding mode.


Table 6: Comparison of rounding modes.

Rounding mode Area (um2) Delay (ns)

Jamming 7893 5.8Truncation 7741 (−1.9%) 5.71 (−1.6%)

(a) 14-bit FP adder.

Rounding mode Area (um2) Delay (ns)

Jamming 8401 10.21

Truncation 7123 (−15%) 9.43 (−7.6%)

(b) 14-bit FP multiplier.

Table 7: Comparison of exception handling.

Exception handling Area (um2) Delay (ns)

Full 8401 10.21

Partial 7545 (−10%) 9.26 (−9.3%)

(a) 14-bit FP adder.

Exception handling Area (um2) Delay (ns)

Full 7893 5.8

Partial 7508 (−4.9%) 4.62 (−20%)

(b) 14-bit multiplier.

(a) Radix-2 FP (8-bit fraction+1 leading bit).

(b) Radix-16 FP (11-bit fraction).

Figure 19: Shifting positions for radix-2 and radix-16 FP.

5.3. Exception handling

For an IEEE compliant FP arithmetic unit, a large portionof hardware in critical timing path is dedicated for rare ex-ceptional cases, for example, overflow, underflow, infinite,NaN and so forth. If the exponent bit-width is enough toavoid overflow, then infinite and NaN will not occur duringcomputation. Then in the FP adder and multiplier diagram(Figure 17), exception detection is not needed and only un-derflow is detected in exception handling. As a result, the de-lay of the FP adder and multiplier is reduced by 9.3% and20%, respectively, using partial exception handling.

5.4. Radix

The advantage of higher radix FP is less complex in theshifter. From Section 4, we know that in video codec, the pre-cision of radix-16 FP with 11-bit fraction is close to radix-2

Table 8: Comparison of radix 2 and radix 16.

Radix Area (um2) Delay (ns)

2 8401 10.21

16 7389 (−12%) 8.48 (−17%)

(a) FP adder.

Radix Area (um2) Delay (ns)

2 7893 5.8

16 11284 (+43%) 6.62 (+14%)

(b) FP multiplier.

Table 9

Data format Area (um2) Delay (ns) Power (mW)

Lightweight 10666 5.67 16.5

IEEE standard 51830 14.5 100.1

(a) Comparison of lightweight FP and IEEE FP-adder.

Data format Area (um2) Delay (ns) Power (mW)

Lightweight 5206 7.49 6.9

IEEE standard 19943 21.7 29.3

(b) Comparison of lightweight FP and IEEE FP-multiplier.

FP with 8-bit fraction. We illustrate the difference in shifterin (Figure 19). The step size of shifting for radix-16 is fourand only three shifting positions are needed. Such a simpleshifter can be implemented by the structure of 3-to-1 Mux.Although there are 3 more bits in the fraction, FP adder stillbenefited from higher radix in both area and delay (Table 8).

On the other hand, more fraction width increases thecomplexity of the FP multiplier. The size of multiplier isincreased about quadratically with the bit-width, so only 3more bits can increase the multiplier’s area by 43% (Table 8).

From the table, it is clear that radix-16 is not always bet-ter than radix-2. In a certain application, if there are moreadders than multipliers, then radix-16 is a better choice thanradix-2. In our IDCT structure, there are 29 adders and 11multipliers. Therefore, radix-16 is chosen for the implemen-tation of IDCT.

Combining all the optimization strategies in this sec-tion, the final 15-bit radix-16 FP adder and multiplier arecompared with the IEEE standard compliant single-precisionFP adder and multiplier in Table 9. Reducing the bit-widthand simplifying the rounding/shifting/exception handling,the power consumption of FP arithmetic unit is cut downto around 1/5 of the IEEE standard FP unit.

Further optimization can be conducted in two direc-tions. One is low-power design approach. As proposed in[17], triple data path in FP adder structure can reduce thepower delay product by 16X. The other is transistor level op-timization. Shifter and multiplexor designed in transistors


x7

x5

x3

x1

x6

x4

x2

x0

y5

y2

y4

y3

y6

y1

y7

y0

C4 C6/C4 C5

C2/C4 C7

1/C4

C4 C6/C4 C3

C2/C4 C1

Ci = 1

2 cosiπ

16

x

y

y − xy + x

xC

C ∗ x x

y y

y + x

Butterfly

Figure 20: Structure of IDCT.

are much smaller and faster than those implemented as logicgates [15, 18].

6. IMPLEMENTATION OF IDCT

IDCT performs linear transform on an 8-input data set.The algorithm developed in [19] uses minimum resources,11 multiplications and 29 additions, to compute a one-dimensional IDCT (Figure 20).

6.1. Optimization in butterfly

In the architecture of IDCT, there are 12 butterflies and eachone is composed of two adders. Because these two addershave the same inputs, some operations such as prealignmentand negation can be shared. Further, in the butterfly struc-ture, normalization and leading-one detection can be sepa-rated into two paths and executed in parallel, which reducethe timing critical path.

Figure 21a is a diagram of the FP adder, and Figure 21bis a diagram of the butterfly with the function of two FPadders. From the figure, we can see that a butterfly is similarto one FP adder in structure except one extra integer adderand some selection logic. Table 10 shows that such butterflystructure saves 37% area compared with two simple adders.

This reduction is due to the property of FP addition. Fora fixed-point butterfly, no operations can be shared.

6.2. Results comparison

We implement the IDCT in 32-bit IEEE FP, 15-bit radix-16lightweight FP, and fixed-point algorithms (see the compari-son in Table 11). In the fixed-point implementation, we pre-serve 12-bit accuracy for constants, and the widest bit-widthis 24 in the whole algorithm (not fine tuned). From the per-spective of power, the lightweight FP IDCT consumes onlyaround 1/10 of the power compared to the IEEE FP IDCT,and is comparable with the fixed-point implementation.

7. CONCLUSION

In this paper, we introduce C++ and Verilog librariesof lightweight FP arithmetic, focusing on the most criti-cal arithmetic operators (addition, multiplication), and the

fa sign fb

Alignment

Negation

Mux

Fraction addition

Leading-one detection &normalization

fa + fb

(a) FP adder.

fa fb

Alignment

Negation

Fraction addition Fraction addition

Normalization Leading-onedetection

Selection logic

fa + fb fa − fb

(b) FP butterfly.

Figure 21: Diagrams of a FP adder and a FP butterfly.

most common parameterizations useful for multimedia tasks(bit-width, rounding modes, exception handling, radix).With these libraries, we can easily translate a standard FPprogram to a lightweight FP program, and explore the systemnumerical performance versus hardware complexity trade-off.

An H.263 video codec is chosen to be our benchmark.Such media applications do not need a wide dynamic rangeand high precision in computations, so the lightweight FPcan be applied efficiently. By examining the histogram in-formation of FP numbers and relationship between PSNRand bit-width, we demonstrate that our video codec havealmost no quality degradation when more than half of thebit-width in standard FP is reduced. Other features specifiedin the standard FP, such as rounding modes, exception han-dling and the radix choice, are also discussed for this partic-


Table 10: Comparison of two FP adders and a “butterfly.”

Structure Area (um2) Delay (ns)

Two FP adders 10412 7.49

Butterfly 6545 8.14

Table 11: Comparison of three implementations of IDCT.

Implementation Area (um2) Delay (ns) Power (mW)

IEEE FP 926810 111 1360

Lightweight FP 216236 46.75 143

Fixed-point 106598 36.11 110

ular application. Such optimization offers huge reduction inhardware cost. In the hardware implementation of IDCT, wecombined two FP adders in a butterfly (basic component ofIDCT), which further reduced the hardware cost. At last, weshow that power consumption of the lightweight FP IDCT isonly 10.5% of the standard FP IDCT, and comparable to thefixed-point IDCT.

ACKNOWLEDGMENT

This work was funded in part by the Pittsburgh Digital Greenhouse, and the Semiconductor Research Corporation.

REFERENCES

[1] D. Dobberpuhl, “The design of a high performance low powermicroprocessor,” in International Symposium on Low PowerElectronics and Design, pp. 11–16, Montery, Calif, USA, Au-gust 1996.

[2] S. Kim and W. Sung, “Fixed-point error analysis and wordlength optimization of 8 × 8 IDCT architectures,” IEEETrans. Circuits and Systems for Video Technology, vol. 8, no.8, pp. 935–940, 1998.

[3] K. Kum, J. Kang, and W. Sung, “AUTOSCALER for C: Anoptimizing floating-point to integer C program converter forfixed-point digital signal processors,” IEEE Trans. on Circuitsand Systems II: Analog and Digital Signal Processing, vol. 47,no. 9, pp. 840–848, 2000.

[4] The Institute of Electrical and Electronics Engineers, Inc.,“IEEE-standard for Binary Floating-Point Arithmetic,”ANSI/IEEE Std 754-1985, 1985.

[5] D. M. Samanj, J. Ellinger, E. J. Powers, and E. E. Swartzlander,“Simulation of variable precision IEEE floating point usingC++ and its application in digital signal processor design,” inCircuits and Systems, Proceedings of the 36th Midwest Sympo-sium on, pp. 1509–1514, 1993.

[6] R. Ignatowski and E. E. Swartzlander, “Creating new algo-rithm and modifying old algorithms to use the variable pre-cision floating point simulator,” in Signals, Systems and Com-puters, 1994 Conference Record of the 28th Asilomar Conferenceon, vol. 1, pp. 152–156, 1994.

[7] X. Wan, Y. Wang, and W. H. Chen, “Dynamic range analysisfor the implementation of fast transform,” IEEE Trans. Cir-

cuits and Systems for Video Technology, vol. 5, no. 2, pp. 178–180, 1995.

[8] P. C. H. Meier, R. A. Rutenbar, and L. R. Carley, “Exploringmultiplier architecture and layout for low power,” in Proc.IEEE 1996 Custom Integrated Circuits Conference, pp. 513–516,San Diego, Calif, USA, May 1996.

[9] A. W. Burks, H. H. Goldstine, and J. von Neumann, Prelimi-nary Discussion of the Logical Design of an Electronics Comput-ing Instrument, Computer Structures: Reading and Examples.McGraw-Hill, 1971.

[10] The Institute of Electrical and Electronics Engineers, Inc.,“A Proposed Standard for Binary Floating-Point Arithmetic,”Draft 8.0 of IEEE Task P754, 1981.

[11] S. F. Anderson, J. G. Earle, R. E. Goldschmidt, and D. M. Pow-ers, “The IBM system 390 model 91: floating point executionunit,” IBM Journal of Research and Development, vol. 11, no.1, pp. 34–53, 1967.

[12] Institute of Electrical and Electronics Engineers, Inc., “IEEE-standard specifications for the implementations of 8 × 8 in-verse discrete cosine transform,” IEEE Std 1180-1990, 1990.

[13] E. Hokenek and R. K. Montoye, “Leading-zero anticipator(LZA) in the IBM RISC system/6000 floating-point executionunit,” IBM Journal of Research and Development, vol. 34, no.1, pp. 71–77, 1990.

[14] S. F. Oberman, H. AL-Twaijry, and M. J. Flynn, “The SNAPproject: design of floating point arithmetic units,” in Proc.13th Symposium on Computer Arithmetic, pp. 156–165, Asilo-mar, Calif, USA, July 1997.

[15] K. P. Acken, M. J. Irwin, and R. M. Owens, “Power compar-isons for barrel shifters,” in International Symposium on LowPower Electronics and Design, pp. 209–212, Monterey, Calif,USA, August 1996.

[16] R. K. Montoye, E. Hokenek, and S. L. Runyon, “Design of theIBM RISC system/6000 floating-point execution unit,” IBMJournal of Research and Development, vol. 34, no. 1, pp. 59–70,1990.

[17] R. V. K. Pillai, D. Al-Khalili, and A. J. Al-Khalili, “A low powerapproach to floating point adder design,” in Proc. IEEE Inter-national Conference on Computer Design: VLSI in Computersand Processors, pp. 178–185, Austin, Tex, USA, October 1997.

[18] J. M. Rabaey, Digital Integrated Circuits: A Design Perspective,Prentice-Hall, New Jersey, USA, 1996.

[19] A. Artieri and O. Colavin, “A chip set core for image compres-sion,” IEEE Trans. on Consumer Electronics, vol. 36, pp. 395–402, August 1990.

Fang Fang received the B.S. degree in elec-trical engineering from Southeast Univer-sity, China in 1999 and the M.S. degree fromCarnegie Mellon University, Pittsburgh, Pain 2001. Currently, she is a Ph.D. candi-date in the Center for Silicon System Im-plementation and the Advanced Multime-dia Processing Lab in Electrical and Com-puter Engineering Department at CarnegieMellon Univesity. Her research interest in-cludes numerical performance of signal processing algorithmsand high level hardware synthesis of DSP algorithms. Her cur-rent work focus on automatic design flow of the lightweightfloating-point system, and hardware synthesis of WHT and FFTtransforms.


Tsuhan Chen received the B.S. degree inelectrical engineering from the NationalTaiwan University in 1987, and the M.S.and Ph.D. degrees in electrical engineer-ing from the California Institute of Tech-nology, Pasadena, California, in 1990 and1993, respectively. Since October 1997,Tsuhan Chen has been with the Depart-ment of Electrical and Computer Engi-neering, Carnegie Mellon University, Pitts-burgh, Pennsylvania, where he is now a Professor. He directs theAdvanced Multimedia Processing Laboratory. His research interestsinclude multimedia signal processing and communication, audio-visual interaction, biometrics, processing of 2D/3D graphics, bioin-formatics, and building collaborative virtual environments. FromAugust 1993 to October 1997, he worked in the Visual Communi-cations Research Department, AT&T Bell Laboratories, Holmdel,New Jersey, and later at AT&T Labs-Research, Red Bank, New Jer-sey. Tsuhan helped create the Technical Committee on Multime-dia Signal Processing, as the founding chair, and the MultimediaSignal Processing Workshop, both in the IEEE Signal ProcessingSociety. He has recently been appointed as the Editor-in-Chief forIEEE Transactions on Multimedia for the period 2002–2004. Hehas coedited a book titled “Advances in Multimedia: Systems, Stan-dards, and Networks.” He is a recipient of the National ScienceFoundation Career Award.

Rob A. Rutenbar received the Ph.D. de-gree from the University of Michigan in1984, and subsequently joined the faculty ofCarnegie Mellon University. He is currentlythe Stephen J. Jatras Professor of Electricaland Computer Engineering, and (by cour-tesy) of Computer Science. He is the found-ing Director of the MARCO/DARPA Cen-ter for Circuits, Systems, Software (C2S2),a consortium of U.S. Universities charteredin 2001 to explore long-term solutions for next-generation cir-cuit challenges. His research interests focus on circuit and lay-out synthesis algorithms for mixed-signal ASICs, and for high-speed digital systems. In 1987, Dr. Rutenbar received a Presiden-tial Young Investigator Award from the National Science Founda-tion. He was General Chair of the 1996 International Conferenceon CAD. From 1992 through 1996, he chaired the Analog Techni-cal Advisory Board for Cadence Design Systems. In 2001 he wascowinner of the Semiconductor Research Corporation’s AristotleAward for contributions to graduate education. He is a Fellow ofthe IEEE and a member of the ACM and Eta Kappa Nu.


High-Level Synthesis of DSP ApplicationsUsing Adaptive Negative Cycle Detection

Nitin ChandrachoodanDepartment of Electrical and Computer Engineering, University of Maryland, College Park, MD 20742, USAEmail: [email protected]

Shuvra S. BhattacharyyaDepartment of Electrical and Computer Engineering, University of Maryland, College Park, MD 20742, USAEmail: [email protected]

K. J. Ray LiuDepartment of Electrical and Computer Engineering, University of Maryland, College Park, MD 20742, USAEmail: [email protected]

Received 31 August 2001 and in revised form 15 May 2002

The problem of detecting negative weight cycles in a graph is examined in the context of the dynamic graph structures that arisein the process of high level synthesis (HLS). The concept of adaptive negative cycle detection is introduced, in which a graphchanges over time and negative cycle detection needs to be done periodically, but not necessarily after every individual change.We present an algorithm for this problem, based on a novel extension of the well-known Bellman-Ford algorithm that allows usto adapt existing cycle information to the modified graph, and show by experiments that our algorithm significantly outperformsprevious incremental approaches for dynamic graphs. In terms of applications, the adaptive technique leads to a very fast imple-mentation of Lawlers algorithm for the computation of the maximum cycle mean (MCM) of a graph, especially for a certain formof sparse graph. Such sparseness often occurs in practical circuits and systems, as demonstrated, for example, by the ISCAS 89/93benchmarks. The application of the adaptive technique to design-space exploration (synthesis) is also demonstrated by developingautomated search techniques for scheduling iterative data-flow graphs.

Keywords and phrases: negative cycle detection, dynamic graphs, maximum cycle mean, adaptive performance estimation.

1. INTRODUCTION

High-level synthesis of circuits for digital signal processing(DSP) applications is an area of considerable interest due tothe rapid increase in the number of devices requiring multi-media and DSP algorithms. High-level synthesis (HLS) playsan important role in the overall system synthesis process be-cause it speeds up the process of converting an algorithminto an implementation in hardware, software, or a mixtureof both. In HLS, the algorithm is represented in an abstractform (usually a dataflow graph), and this representation istransformed and mapped onto architectural elements from alibrary of resources. These resources could be pure hardwareelements like adders or logic gates, or they could be generalpurpose processors with the appropriate software to executethe required operations. The architecture could also involvea combination of both the above, in which case the problembecomes one of hardware-software cosynthesis.

HLS involves several stages and requires computation of

several parameters. In particular, performance estimation isone very important part of the HLS process, and the actualdesign space exploration (synthesis of architecture) is an-other. Performance estimation involves using timing infor-mation about the library elements to obtain an estimate ofthe throughput that can be obtained from the synthesizedimplementation. A particularly important estimate for iter-ative dataflow graphs is known as the maximum cycle mean(MCM) [1, 2]. This quantity provides a bound on the max-imum throughput attainable by the system. A fast methodfor computing the MCM would therefore enable this metricto be computed for a large number of system configurationseasily.

The other important problem in HLS is the problem ofdesign space exploration. This requires the selection of an ap-propriate set of elements from the resource library and map-ping of the dataflow graph functions onto these resources.This problem is known to be NP-complete, and there existseveral heuristic approaches that attempt to provide suffi-


ciently good solutions. An important feature of HLS of DSPapplications is that the mapping to hardware needs to bedone only once for a given design, which is then produced inlarge quantities for a consumer market. As a result, for theseapplications (such as modems, wireless phones, multimediaterminals, etc.), it makes sense to consider the possibility ofinvesting large amounts of computational power at compiletime, so that a more optimal result can be used at run time.

The problems in HLS described above both have thecommon feature of requiring a fast solution to the problemof detecting negative cycles in a graph. This is because theexecution times of the various resources combine with thegraph structure to impose a set of constraints on the system,and checking the feasibility of this set of constraints is equiv-alent to checking for the presence of negative cycles in thecorresponding constraint graph.

DSP applications, more than other embedded applica-tions considered in HLS, have the property that they arecyclic in nature. As explained in Section 2, this means thatthe problem of negative cycle detection in constraint analy-sis is more relevant to such systems. In order to make use ofthe increased computational power that is available, one pos-sibility is to conduct more extensive searches of the designspace than is performed by a single heuristic. One possibleapproach to this problem involves an iterative improvementsystem based on generating modified versions of an existingimplementation and verifying their correctness. The incre-mental improvements can then be used to guide a search ofthe design space that can be tailored to fit in the maximumtime allotted to the exploration problem. In this process, themost computationally intensive part is the process of verify-ing correctness of the modified systems, and therefore speed-ing up this process would have a direct impact on the size ofthe explored region of the design space.

In addition to these problems from HLS, several otherproblems in circuits and systems theory require the solv-ing of constraint equations [2, 3, 4, 5, 6]. Examples includevery large scale integrated circuit (VLSI) layout compaction,interactive (reactive) systems, graphic layout heuristics, andtiming analysis and retiming of circuits for performance orarea considerations. Though a general system of constraintswould require a linear programming (LP) approach to solveit, several problems of interest actually consist of the specialcase of difference constraints (each constraint expresses theminimum or maximum value that the difference of two vari-ables in the system can take). These problems can be attackedby faster techniques than the general LP, mostly involving thesolution of a shortest path problem on a weighted directedgraph. Detection of negative cycles in the graph is therefore aclosely related problem, as it would indicate the infeasibilityof the constraint system.

Because of the above reasons, detecting the presence ofnegative cycles in a weighted directed graph is a very im-portant problem in systems theory. This problem is alsoimportant in the computation of network flows. Consider-able effort has been spent on finding efficient algorithms forthis purpose. Cherkassky and Goldberg [3] have performeda comprehensive survey of existing techniques. Their study

shows some interesting features of the available algorithms,such as the fact that for a large class of random graphs, theworst case performance bound is far more pessimistic thanthe observed performance.

There are also situations in which it is useful or neces-sary to maintain a feasible solution to a set of difference con-straints as a system evolves. Typical examples of this would bereal-time or interactive systems, where constraints are addedor removed one (or several) at a time, and after each suchmodification it is required to determine whether the result-ing system has a feasible solution and if so, to find it. In thesesituations, it is often more efficient to adapt existing infor-mation to aid the solution of the constraint system. In theexample from HLS that was mentioned previously, it is pos-sible to cast the problem of design space exploration in a waythat benefits from this approach.

Several researchers [5, 7, 8] have worked on the area ofincremental computation. They have presented analyses of al-gorithms for the shortest path problem and negative cycledetection in dynamic graphs. Most of the approaches try toapply the modifications of Dijkstra’s algorithm to the prob-lem. The obvious reason for this is that this is the fastestknown algorithm for the problem when only positive weightsare allowed on edges. However, the use of Dijkstra’s algo-rithm as the basis for incremental computation requires thechanges to be handled one at a time. While this may oftenbe efficient enough, there are many cases where the abilityto handle multiple changes simultaneously would be moreadvantageous. For example, it is possible that in a sequenceof changes, one reverses the effect of another: in this case, anormal incremental approach would perform the same com-putation twice, while a delayed adaptive computation wouldnot waste any effort.

In this paper, we present an approach that generalizesthe adaptive approach beyond single increments: we addressmultiple changes to the graph simultaneously. Our approachcan be applied to cases where it is possible to collect severalchanges to the graph structure before updating the solutionto the constraint set. As mentioned previously, this can re-sult in increased efficiency in several important problems. Wepresent simulation results comparing our method against thesingle-increment algorithm proposed in [4]. For larger num-bers of changes, our algorithm performs considerably betterthan this incremental algorithm.

To illustrate the advantages of our adaptive approach, wepresent two applications from the area of HLS, requiring thesolution of difference constraint problems, which thereforebenefit from the application of our technique. For the prob-lem of performance estimation, we show how the new tech-nique can be used to derive a fast implementation of Lawler’salgorithm [9] for the problem of computing the MCM ofa weighted directed graph. We present experimental resultscomparing this against Howard’s algorithm [2, 10], whichappears to be the fastest algorithm available in practice. Wefind that for graph sizes and node-degrees similar to those ofreal circuits, our algorithm often outperforms even Howard’salgorithm.

For the problem of design space exploration, we present a

High-Level Synthesis of DSP Applications Using Adaptive Negative Cycle Detection 895

search technique for finding schedules for iterative dataflowgraphs that uses the adaptive negative cycle detection al-gorithm as a subroutine. We illustrate the use of this localsearch technique by applying it to the problem of resource-constrained scheduling for minimum power in the presenceof functional units that can operate at multiple voltages. Themethod we develop is quite general and can therefore easilybe extended to the case of optimization where we are inter-ested in optimizing other criteria rather than the power con-sumption.

This paper is organized as follows. Section 2 surveysprevious work on shortest path algorithms and incremen-tal algorithms. In Section 3, we describe the adaptive algo-rithm that works on multiple changes to a graph efficiently.Section 4 compares our algorithm against the existing ap-proach, as well as against another possible candidate foradaptive operation. Section 5 then gives details of the appli-cations mentioned above, and presents some experimentalresults. Finally, we present our conclusions and examine ar-eas that would be suitable for further investigation.

A preliminary version of the results presented in this pa-per were published in [11].

2. BACKGROUND AND PROBLEM FORMULATION

Cherkassky and Goldberg [3] have conducted an extensivesurvey of algorithms for detecting negative cycles in graphs.They have also performed a similar study on the problemof shortest path computations. They present several prob-lem families that can be used to test the effectiveness ofa cycle-detection algorithm. One surprising fact is that thebest known theoretical bound (O(|V ||E|), where |V | is thenumber of vertices and |E| is the number of edges in thegraph) for solving the shortest path problem (with arbitraryweights) is also the best known time bound for the negativecycle problem. But examining the experimental results fromtheir work reveals the interesting fact that in almost all of thestudied samples, the performance is considerably less costlythan would be suggested by the product |V |× |E|. It appearsthat the worst case is rarely encountered in random exam-ples, and an average case analysis of the algorithms might bemore useful.

Recently, there has been increased interest in the sub-ject of dynamic or incremental algorithms for solving prob-lems [5, 7, 8]. This uses the fact that in several problemswhere a graph algorithm such as shortest paths or transi-tive closure needs to be solved, it is often the case that weneed to repeatedly solve the problem on variants of the origi-nal graph. The algorithms therefore store information aboutthe problem that was obtained during a previous iterationand use this as an efficient starting point for the new prob-lem instance corresponding to the slightly altered graph. Theconcept of bounded incremental computation introduced in[7] provides a framework within which the improvement af-forded by this approach can be quantified and analyzed.

In this paper, the problem we are most interested in isthat of maintaining a solution to a set of difference con-straints. This is equivalent to maintaining a shortest path

tree in a dynamic graph [4]. Frigioni et al. [5] present an al-gorithm for maintaining shortest paths in arbitrary graphsthat performs better than starting from scratch, while Rama-lingam and Reps [12] present a generalization of the short-est path problem, and show how it can be used to handlethe case where there are few negative weight edges. In bothof these cases, they have considered one change at a time(not multiple changes), and the emphasis has been on thetheoretical time bound, rather than experimental analysis. In[13], the authors present an experimental study, but only forthe case of positive weight edges, which restricts the study tocomputation of shortest paths and does not consider nega-tive weight cycles.

The most significant work along the lines we propose isdescribed in [4]. In this, the authors use the observation thatin order to detect negative cycles, it is not necessary to main-tain a tree of the shortest paths to each vertex. They suggestan improved algorithm based on Dijkstra’s algorithm, whichis able to recompute a feasible solution (or detect a negativecycle) in timeO(E+V logV), or in terms of output complexity(defined and motivated in [4]) O(‖∆‖ + |∆| log |∆|), where|∆| is the number of variables whose values are changedand ‖∆‖ is the number of constraints involving the variableswhose values have changed.

The above problem can be generalized to allow multiplechanges to the graph between calls to the negative cycle de-tection algorithm. In this case, the above algorithms wouldrequire the changes to be handled one at a time, and thereforewould take time proportional to the total number of changes.On the other hand, it would be preferable if we could obtaina solution whose complexity depends on the number of up-dates requested, rather than the total number of changes ap-plied to the graph. Multiple changes between updates to thenegative cycle computation arise naturally in many interac-tive environments (e.g., if we prefer to accumulate changesbetween refreshes of the state, using the idea of lazy evalua-tion) or in design space-exploration, as can be seen, for ex-ample, in Section 5.2. By accumulating changes and process-ing them in large batches, we remove a large overhead fromthe computation, which may result in considerably faster al-gorithms.

Note that the work in [4] also considers the addition/deletion of constraints only one at a time. It needs to beemphasized that this limitation is basic to the design of thealgorithm: Dijkstra’s algorithm can be applied only whenthe changes are considered one at a time. This is accept-able in many contexts since Dijkstra’s algorithm is the fastestalgorithm for the case where edge weights are positive. If wetry using another shortest paths algorithm we would incur aperformance penalty. However, as we show, this loss in per-formance in the case of unit changes may be offset by im-proved performance when we consider multiple changes.

The approach we present for the solution is to extendthe classical Bellman-Ford algorithm for shortest paths insuch a way that the solution obtained in one problem in-stance can be used to reduce the complexity of the solutionin modified versions of the graph. In the incremental case(single change to the graph) this problem is related to the


problem of analyzing the sensitivity of the algorithm [14].The sensitivity analysis tries to study the performance of analgorithm when its inputs are slightly perturbed. Note thatthere do not appear to be any average case sensitivity anal-yses of the Bellman-Ford algorithm, and the approach pre-sented in [14] has a quadratic running time in the size ofthe graph. This analysis is performed for a general graphwithout regard to any special properties it may have. But asexplained in Section 5.1.1, graphs corresponding to circuitsand systems in HLS for DSP are typically very sparse—mostbenchmark graphs tend to have a ratio of about 2 edges pervertex, and the number of delay elements is also small rel-ative to the total number of vertices. Our experiments haveshown that in these cases, the adaptive approach is able todo much better than a quadratic approach. We also pro-vide application examples to show other potential uses of theapproach.

In the following sections, we show that our approachperforms almost as well as the approach in [4] (experimen-tally) for changes made one at a time, and significantly out-performs their approach under the general case of multi-ple changes (this is true even for relatively small batches ofchanges, as will be seen from the results). Also, when thenumber of changes between updates is very large, our algo-rithm reduces to the normal Bellman-Ford algorithm (start-ing from scratch), so we do not lose in performance. This isimportant since when a large number of changes are made,the problem can be viewed as one of solving the shortest-path problem for a new graph instance, and we should notperform worse than the standard available technique for that.

Our interest in adaptive negative cycle detection stemsprimarily from its application in the problems of HLS thatwe outlined in the introduction. To demonstrate its useful-ness in these areas, we have used this technique to obtainimproved implementations of the performance estimationproblem (computation of the MCM) and to implement an it-erative improvement technique for design space exploration.Dasdan et al. [2] present an extensive study of existing al-gorithms for computing the MCM. They conclude that themost efficient algorithm in practice is Howard’s algorithm[10]. We show that the well-known Lawler’s algorithm [9],when implemented using an efficient negative cycle detec-tion technique and with the added benefit of our adaptivenegative cycle detection approach, actually outperforms thisalgorithm for several test cases, including several of the IS-CAS benchmarks, which represent reasonable sized circuits.

As mentioned previously, the relevance of negative cycledetection to design space exploration is because of the cyclicnature of the graphs for DSP applications. That is, there is of-ten a dependence between the computation in one iterationand the values computed in previous iterations. Such graphsare referred to as iterative dataflow graphs [15]. Traditionalscheduling techniques tend to consider only the latency ofthe system, converting it to an acyclic graph if necessary. Thiscan result in loss of the ability to exploit inter-iteration par-allelism effectively. Methods such as optimum unfolding [16]and range-chart guided scheduling [15] are techniques that tryto avoid this loss in potential parallelism by working directly

on the cyclic graph. However, they suffer from some disad-vantages of their own. Optimum unfolding can potentiallylead to a large increase in the size of the resulting graph tobe scheduled. Range chart guided scheduling is a determin-istic heuristic that could miss potential solutions. In addi-tion, the process of scanning through all possible time inter-vals for scheduling an operation can work only when the runtimes of operations are small integers. This is more suited toa software implementation than a general hardware design.These techniques also work only after a function to resourcebinding is known, as they require timing information for thefunctions in order to schedule them. For the general architec-ture synthesis problem, this binding itself needs to be foundthrough a search procedure, so it is reasonable to consideralternate search schemes that combine the search for archi-tecture with the search for a schedule.

If the cyclic dataflow graph is used to construct a con-straint graph, then feasibility of the resulting system is de-termined by the absence of negative cycles in the graph. Thiscan be used to obtain exact schedules capable of attaining theperformance bound for a given function to resource bind-ing. For the problem of design space exploration, we treat theproblem of scheduling an iterative dataflow graph (IDFG) asa problem of searching for an efficient ordering of functionvertices on processors, which can be treated as addition ofseveral timing constraints to an existing set of constraints.We implement a simple search technique that uses this ap-proach to solve a number of scheduling problems, includ-ing scheduling for low-power on multiple-voltage resources,and scheduling on homogeneous processors, within a singleframework. Since the feasibility analysis forms the core of thesearch, speeding this up should result in a proportionate in-crease in the number of designs evaluated (until such a pointthat this is no longer the bottleneck in the overall compu-tation). The adaptive negative cycle detection technique en-sures that we can do such searches efficiently, by restrictingthe computations required.

3. THE ADAPTIVE BELLMAN-FORD ALGORITHM

We present the basis of the adaptive approach that enablesefficient detection of negative cycles in dynamic graphs.

We first note that the problem of detecting negative cy-cles in a weighted directed graph (digraph) is equivalent tofinding whether or not a set of difference inequality con-straints has a feasible solution. To see this, observe that if wehave a set of difference constraints of the form

xi − xj ≤ bi j , (1)

we can construct a digraph with vertices corresponding tothe xi, and an edge (ei j) directed from the vertex correspond-ing to xi to the vertex for xj such that weight(ei j) = bi j . Thisprocedure is performed for each constraint in the system anda weighted directed graph is obtained. Solving for shortestpaths in this graph would yield a set of distances dist that sat-isfy the constraints on xi. This graph is henceforth referred toas the constraint graph.


The usual technique used to solve for dist is to introducean imaginary vertex s0 to act as a source, and introduce edgesof zero-weight from this vertex to each of the other vertices.The resulting graph is referred to as the augmented graph [4].In this way, we can use a single-source shortest paths algo-rithm to find dist from s0, and any negative cycles (infeasiblesolution) found in the augmented graph must also be presentin the original graph, since the new vertex and edges cannotcreate cycles.

The basic Bellman-Ford algorithm does not provide astandard way of detecting negative cycles in the graph. How-ever, it is obvious from the way the algorithm operates thatif changes in the distance labels continue to occur for morethan a certain number of iterations, there must be a negativecycle in the graph. This observation has been used to detectnegative cycles, and with this straightforward implementa-tion, we obtain an algorithm to detect negative cycles thattakes O(|V |3) time, where |V | is the number of vertices inthe graph.

The study by Cherkassky and Goldberg [3] presents sev-eral variants of the negative cycle detection technique. Thetechnique they found to be most efficient in practice isbased on the subtree disassembly technique proposed by Tar-jan [17]. This algorithm works by constructing a shortestpath tree as it proceeds from the source of the problem, andany negative cycle in the graph will first manifest itself as a vi-olation of the tree order in the construction. The experimen-tal evaluation presented in their study found this algorithmto be a robust variant for the negative cycle detection prob-lem. As a result of their findings, we have chosen this algo-rithm as the basis for the adaptive algorithm. Our modifiedalgorithm is henceforth referred to as the “adaptive Bellman-Ford (ABF)” algorithm.

The adaptive version of the Bellman-Ford algorithmworks on the basis of storing the distance labels that werecomputed from the source vertex from one iteration to thenext. Since the negative cycle detection problem requires thatthe source vertex is always the same (the augmenting vertex),it is intuitive that as long as most edge weights do not change,the distance labels for most of the vertices will also remain thesame. Therefore, by storing this information and using it asa starting point for the negative cycle detection routines, wecan save a considerable amount of computation.

One possible objection to this system is that we wouldneed to scan all the edges each time in order to detect ver-tices that have been affected. But in most applications involv-ing multiple changes to a graph, it is possible to pass infor-mation to the algorithm about which vertices have been af-fected. This information can be generated by the higher levelapplication-specific process making the modifications. Forexample, if we consider multiprocessor scheduling, the high-level process would generate a new vertex ordering, and addedges to the graph to represent the new constraints. Since anychanges to the graph can only occur at these edges, the appli-cation can pass on to the ABF algorithm precise informationabout what changes have been made to the graph, thus savingthe trouble of scanning the graph for changes.

Note that in the event where the high-level application

cannot pass on this information without adding significantbookkeeping overhead, the additional work required for ascan of the edges is proportional to the number of edges, andhence does not affect the overall complexity, which is at leastas large as this. For example, in the case of the maximum cy-cle mean computation examined in Section 5.1, for most cir-cuit graphs the number of edges with delays is about 1/10 asmany as the total number of edges. With each change in thetarget iteration period, most of these edges will cause con-straint violations. In such a situation, an edge scan providesa way of detecting violations that is very fast and easy to im-plement, while not increasing the overall complexity of themethod.

3.1. Correctness of the method

The use of a shortest path routine to find a solution to a sys-tem of difference constraint equations is based on the follow-ing two theorems, which are not hard to prove (see [18]).

Theorem 1. A system of difference constraints is consistent ifand only if its augmented constraint graph has no negative cy-cles, and the latter condition holds if and only if the originalconstraint graph has no negative cycles.

Theorem 2. Let G be the augmented constraint graph of a con-sistent system of constraints 〈V,C〉. Then D is a feasible solu-tion for 〈V,C〉, where

D(u) = distG(s0, u

). (2)

The augmented constraint graph consists of this graph, to-gether with an additional source vertex (s0) that has zero-weight edges leading to all the other existing vertices, andconsistency means that a set of xi exists that satisfy all the con-straints in the system.

In the adaptive version of the algorithm, we are effectivelysetting the weights of the augmenting edges to be equal to thelabels that were computed in the previous iteration. In thisway, the initial scan from the augmenting vertex sets the dis-tance label at each vertex equal to the previously computedweight instead of setting it to 0. So we now need to showthat using nonzero weights on the augmenting edges doesnot change the solution space in any way: that is, all possiblesolutions for the zero-weight problem are also solutions forthe nonzero-weight problem, except possibly for translationby a constant.

The new algorithm with the adaptation enhancementscan be seen to be correct if we relax the definition of the aug-mented graph so that the augmenting edges (from s0) neednot have zero-weight. We summarize the arguments for thisin the following theorems.

Theorem 3. Consider a constraint graph augmented with asource vertex s0, and edges from this vertex to every other ver-tex v, such that these augmenting edges have arbitrary weightweight(s0 → v). The associated system of constraints is consis-tent if and only if the augmenting graph defined above has nonegative cycles, which in turn holds if and only if the originalconstraint graph has no negative cycles.


Proof. Clearly, since s0 does not have any in-edges, no cy-cles can pass through it. So any cycles, negative or otherwise,which are detected in the augmented graph, must have comefrom the original constraint graph, which in turn wouldhappen only if the constraint system was inconsistent (byTheorem 1). Also, any inconsistency in the original systemwould manifest itself as a negative cycle in the constraintgraph, and the above augmentation cannot remove any suchcycle.

The following theorem establishes the validity of solu-tions computed by the ABF algorithm.

Theorem 4. If G′ is the augmented graph with arbitraryweights as defined above, and D(u) = distG′(s0, u) (shortestpaths from s0), then

(1) D is a solution to 〈V,C〉; and(2) any solution to 〈V,C〉 can be converted into a solution

to the constraint system represented by G′ by adding aconstant to D(u) for each u ∈ V .

Proof. The first part is obvious, by the definition of shortestpaths.

Now we need to show that by augmenting the graph witharbitrary weight edges, we do not prevent certain solutionsfrom being found. To see this, first note that any solution to adifference constraint system remains a solution when trans-lated by a constant. That is, we can add or subtract a constantto all the D(u) without changing the validity of the solution.

In our case, if we have a solution to the constraint systemthat does not satisfy the constraints posed by our augmentedgraph, it is clear that the constraint violation can only be onone of the augmenting edges (since the underlying constraintgraph is the same as in the case where the augmenting edgeshad zero weight). Therefore, if we define

lmax = max{

weight(e) | e ∈ Sa}, (3)

where Sa is the set of augmenting edges and

D′(u) = D(u)− lmax, (4)

we ensure that D′ satisfies all the constraints of the originalgraph, as well as all the constraints on the augmenting edges.

Theorem 4 tells us that an augmented constraint graphwith arbitrary weights on the augmenting edges can also beused to find a feasible solution to a constraint system. Thismeans that once we have found a solution dist : V → R

(where R is the set of real numbers) to the constraint system,we can change the augmented graph so that the weight oneach edge e : u → v is dist(v). Now even if we change theunderlying constraint graph in any way, we can use the sameaugmented graph to test the consistency of the new system.

Figure 1 helps to illustrate the concepts that are explainedin the previous paragraphs. In Figure 1a, there is a change inthe weight of one edge. But as we can see from the augmentedgraph, this will result in only the single update to the affected

D

−2

11

E

−1

1

−3 −3 C

2−1

B

−1

0

A

0

0 0 0 00

Augmenting vertex

(a) Augmenting graph with zero-weight augmenting edges.

D

−2

11

E

−1

1

−3 −3 C

2−2

B

−2

0

A

0

0 −1 −1 −2−3

Augmenting vertex

(b) Augmenting graph with nonzero-weight augmenting edges.

Figure 1: Constraint graph.

vertex itself, and all the other vertices will get their constraintsatisfying values directly from the previous iteration.

Note that in general, several vertices could be affectedby the change in weight of a single edge. For example, inFigure 1 if edge AC had not existed, then changing the weightof AB would have resulted in a new distance label for ver-tices C and D as well. These would be cascading effects fromthe change in the distance label for vertex B. Therefore, whenwe speak of affected vertices, it is not just those vertices in-cident on an edge whose weight has changed, but could alsoconsist of vertices not directly on an edge that has under-gone a change in constraint weight. The actual number ofvertices affected by a single edge-weight change cannot bedetermined just by examining the graph, we would actuallyneed to run through the Bellman-Ford algorithm to find thecomplete set of vertices that are affected.


In the example from Figure 1, the change in weight ofedge AB means that after an initial scan to determine changesin distance labels, we find that vertex B is affected. However,on examining the outgoing edges from vertex B, we find thatall other constraints are satisfied, so the Bellman-Ford al-gorithm can terminate here without proceeding to examineall other edges. Therefore, in this case, there is only 1 ver-tex whose label is affected out of the 5 vertices in the graph.Furthermore, the experiments show that even in large sparsegraphs, the effect of any single change is usually localized to asmall region of the graph, and this is the main reason that theadaptive approach is useful, as opposed to other techniquesthat are developed for more general graphs. Note that, as ex-plained in Section 2, the initial overhead for detecting con-straint violations still holds, but the complexity of this oper-ation is significantly less than that of the Bellman-Ford algo-rithm.

4. COMPARISON AGAINST OTHER INCREMENTALALGORITHMS

We compare the ABF algorithm against (a) the incrementalalgorithm developed in [4] for maintaining a solution to a setof difference constraints (referred to here as the RSJM algo-rithm), and (b) a modification of Howard’s algorithm [10],since it appears to be the fastest known algorithm to computethe cycle mean, and hence can also be used to check for fea-sibility of a system. Our modification allows us to use someof the properties of adaptation to reduce the computation inthis algorithm.

The main idea of the adaptive algorithm is that it is usedas a routine inside a loop corresponding to a larger pro-gram. As a result, in several applications where this negativecycle detection forms a computation bottleneck, there willbe a proportional speedup in the overall application, whichwould be much larger than the speedup in a single run.

It is worth making a couple of observations at this pointregarding the algorithms we compare against.

(1) The RSJM algorithm [4] uses Dijkstra’s algorithm asthe core routine for quickly recomputing the shortest paths.Using the Bellman-Ford algorithm here (even with Tarjan’simplementation) would result in a loss in performance sinceit cannot match the performance of Dijkstra’s algorithmwhen edge weights are positive. Consequently, no benefitwould be derived from the reduced-cost concept used in [4].

(2) The code for Howard’s algorithm was obtained fromthe Internet website of the authors of [10]. The modifica-tions suggested by Dasdan et al. [19] have been taken intoaccount. This method of constraints checking uses Howard’salgorithm to see if the MCM of the system yields a feasiblevalue, otherwise the system is deemed inconsistent.

Another important point is the type of graphs on whichwe have tested the algorithms. We have restricted our atten-tion to sparse graphs, or bounded degree graphs. In particular,we have tried to keep the vertex-to-edge ratio similar to whatwe may find in practice, as in, for example, the ISCAS bench-marks. To understand why such graphs are relevant, notethe following two points about the structural elements usu-

ally found in circuits and signal processing blocks: (a) theytypically have a small, finite number of inputs and outputs(e.g., AND gates, adders, etc. are binary elements) and (b)the fanout that is allowed in these systems is usually limitedfor reasons of signal strength preservation (buffers are used ifnecessary). For these reasons, the graphs representing prac-tical circuits can be well approximated by bounded degreegraphs. In more general DSP application graphs, constraintssuch as fanout may be ignored, but the modular nature ofthese systems (they are built up of simpler, small modules)implies that they normally have small vertex degrees.

We have implemented all the algorithms under the LEDA[20] framework for uniformity. The tests were run on ran-dom graphs, with several random variations performed onthem thereafter. We kept the number of vertices constant andchanged only the edges. This was done for the following rea-son: a change to a node (addition/deletion) may result inseveral edges being affected. In general, due to the randomnature of the graph, we cannot know in advance the exactnumber of altered edges. Therefore, in order to keep track ofthe exact number of changes, we applied changes only to theedges. Note that when node changes are allowed, the argu-ment for an adaptive algorithm capable of handling multiplechanges naturally becomes stronger.

In the discussion that follows, we use the term batch-sizeto refer to the number of changes in a multiple change up-date. That is, when we make multiple changes to a graph be-tween updates, the changes are treated as a single batch, andthe actual number of changes that was made is referred to asthe batch-size. This is a useful parameter to understand theperformance of the algorithms.

The changes that were applied to the graph were of 3types.

(i) Edge insertion: an edge is inserted into the graph, en-suring that multiple edges between vertices do not oc-cur.

(ii) Edge deletions: an edge is chosen at random anddeleted from the graph. Note that, in general, this can-not cause any violations of constraints.

(iii) Edge weight change: an edge is chosen at random andits weight is changed to another random number.

Figure 2 shows a comparison of the running time of the 3algorithms on random graphs. The graphs in question wererandomly generated, had 1000 vertices and 2000 edges each,and a sequence of 10 000 edge change operations (as definedabove) were applied to them. The points in the plot cor-respond to an average over 10 runs using randomly gener-ated graphs. The X-axis shows the granularity of the changes.That is, at one extreme, we apply the changes one at a time,and at the other, we apply all the changes at once and thencompute the correctness of the result. Note that the delayedupdate feature is not used by the RSJM algorithm, whichuses the fact that only one change occurs per test to lookfor negative cycles. As can be seen, the algorithms that usethe adaptive modifications benefit greatly as the batch size isincreased, and even among these, the ABF algorithm far out-performs the Howard algorithm, because the latter actually


104103102101100

Batch size

10−2

10−1

100

101

102

103

Ru

nti

me

(s)

Constant total changes

RSJMABF

Original BFHoward’s

Figure 2: Comparison of algorithms as batch size varies.

10987654321Batch size

0

5

10

15

20

25

30

35

Ru

nti

me

(s)

Varying batch size

RSJMABF

Original BFHoward’s

Figure 3: Constant number of iterations at different batch sizes.

performs most of the computation required to compute themaximum cycle mean of the graph, which is far more thannecessary.

Figure 3 shows a plot of what happens when we apply1000 batches of changes to the graph, but alter the numberof changes per batch, so that the total number of changes ac-tually varies from 1000 to 100 000. As expected, RSJM takestotal time proportional to the number of changes. But theother algorithms take nearly constant time as the batch sizevaries, which provides the benefit. The reason for the almostconstant time seen here is that other bookkeeping operationsdominate over the actual computation at this stage. As the

Table 1: Relative speed of adaptive versus incremental approach forgraph of 1000 nodes, 2000 edges.

Batch size Speedup (RSJM time/ABF time)

1 0.26×2 0.49×5 1.23×

10 2.31×20 4.44×50 10.45×

100 18.61×

104103102101100

Batch size

10−1

100

101

102

103

Ru

nti

me

(s)

Large batch effect (1000 nodes, 2000 edges)

RSJMABFOriginal BF

Figure 4: Asymptotic behavior of the algorithms.

batch size increases (asymptotically), we would expect thatthe adaptive algorithm takes more and more time to operate,finally converging to the same performance as the standardBellman-Ford algorithm.

As mentioned previously, the adaptive algorithm is bet-ter than the incremental algorithm at handling changes inbatches. Table 1 shows the relative speedup for differentbatch sizes on a graph of 1000 nodes and 2000 edges. Al-though the exact speedup may vary, it is clear that as thenumber of changes in a batch increases, the benefit of usingthe adaptive approach is considerable.

Figure 4 illustrates this for a graph with 1000 vertices and2000 edges. We have plotted this on a log-scale to capture theeffect of a large variation in batch size. Because of this, notethat the difference in performance between the incrementalalgorithm and starting from scratch is actually a factor of 3or so at the beginning, which is considerable. Also, this fig-ure does not show the performance of Howard’s algorithm,because as can be seen from Figures 2 and 3, the ABF algo-rithm considerably outperforms Howard’s algorithm in thiscontext.


An important feature that can be noted from Figure 4is the behavior of the algorithms as the number of changesbetween updates becomes very large. The RSJM algorithmis completely unaffected by this increase, since it has tocontinue processing changes one at a time. For very largechanges, even when we start from scratch, we find that thetotal time for update starts to increase, because now the timetaken to implement the changes itself becomes a factor thatdominates overall performance. In between these two ex-tremes, we see that our incremental algorithm provides con-siderable improvements for small batch sizes, but for largebatches of changes, it tends towards the performance of theoriginal Bellman-Ford algorithm for negative cycle detection.

From Figures 3 and 4, we see, as expected, that the RSJMalgorithm takes time proportional to the total number ofchanges. Howard’s algorithm also appears to take more timewhen the number of changes increases. Figure 2 allows us toestimate at what batch size each of the other algorithms be-comes more efficient than the RSJM algorithm. Note that thescale on this figure is also logarithmic.

Another point to note with regard to these experimentsis that they represent the relative behavior for graphs with1000 vertices and 2000 edges. These numbers were chosen toobtain reasonable run times on the experiments. Similar re-sults are obtained for other graph sizes, with a slight trend in-dicating that the break-even point, where our adaptive algo-rithm starts outperforming the incremental approach, shiftsto lower batch-sizes for larger graphs.

5. APPLICATIONS

We present two applications that make extensive use of algo-rithms for negative cycle detection. In addition, these appli-cations also present situations where we encounter the samegraph with slight modifications—either in the edge-weights(MCM computation) or in the actual addition and deletionof a small number of edges (scheduling search techniques).As a result, these provide good examples of the type of appli-cations that would benefit from the adaptive solution to thenegative cycle detection problem. As mentioned in Section 1,these problems are central to the high-level synthesis of DSPsystems.

5.1. Maximum cycle mean computation

The first application we consider is the computation of theMCM of a weighted digraph. This is defined as the maximumover all directed cycles of the sum of the arc weights dividedby the number of delay elements on the arcs. This metricplays an important role in discrete systems and embeddedsystems [2, 21], since it represents the greatest throughputthat can be extracted from the system. Also, as mentioned in[21], there are situations where it may be desirable to recom-pute this measure several times on closely related graphs, forexample, for the purpose of design space exploration. As spe-cific examples, [6] proposes an algorithm for dataflow graphpartitioning where the repeated computation of the MCMplays a key role, and [22] discusses the utility of frequentMCM computation to synchronization optimization in em-

bedded multiprocessors. Therefore, efficient algorithms forthis problem can make it reasonable to consider using suchsolutions instead of the simpler heuristics that are otherwisenecessary. Although several results such as [23, 24] providepolynomial time algorithms for the problem of MCM com-putation, the first extensive study of algorithmic alternativesfor it has been undertaken by Dasdan et al. [2]. They con-cluded that the best existing algorithm in practice for thisproblem appears to be Howard’s algorithm, which, unfor-tunately, does not have a known polynomial bound on itsrunning time.

To model this application, the edge weights on our graphare obtained from the equation

weight(u→ v) = delay(e)× P − exec time(u), (5)

where weight(e) refers to the weight of the edge e : u → v,delay(e) refers to the number of delay elements (flip-flops)on the edge, exec time(u) is the propagation delay of the cir-cuit element that is the source of the vertex, and P is the de-sired clock period that we are testing the system for. In otherwords, if the graph with weights as mentioned above does nothave negative cycles, then P is a feasible clock for the system.We can then perform a binary search in order to computeP to any precision we require. This algorithm is attributedto Lawler [9]. Our contribution here is to apply the adaptivenegative cycle detection techniques to this algorithm and an-alyze the improved algorithm that is obtained as a result.

5.1.1 Experimental setup

For an experimental study, we build on the work by Das-dan et al. [2], where the authors have conducted an exten-sive study of algorithms for this problem. They conclude thatHoward’s algorithm [10] appears to be the fastest experimen-tally, even though no theoretical time bounds indicate this.As will be seen, our algorithm performs almost as well asHoward’s algorithm on several useful sized graphs, and espe-cially on the circuits of the ISCAS 89/93 benchmarks, whereour algorithm typically performs better.

For comparison purposes, we implemented our algo-rithm in the C programming language, and compared itagainst the implementation provided by the authors of [10].Although the authors do not claim their implementation isthe fastest possible, it appears to be a very efficient imple-mentation, and we could not find any obvious ways of im-proving it. As we mentioned in the previous section, the im-plementation we used incorporates the improvements pro-posed by Dasdan et al. [2]. The experiments were run on aSun Ultra SPARC-10 (333 MHz processor, 128 MB memory).This machine would classify as a medium-range workstationunder present conditions.

It is clear that the best performance bound that can beplaced on the algorithm as it stands is O(|V ||E| logT) whereT is the maximum value of P that we examine in the searchprocedure, and |V | and |E| are, respectively, the size of theinput graph in number of vertices and edges. However, ourexperiments show that it performs significantly faster thanwould be expected by this bound.


One point to note is that since we are doing a binarysearch on T , we are forced to set a limit on the precisionto which we compute our answer. This precision in turn de-pends on the maximum value of the edge-weights, as wellas the actual precision desired in the application itself. Sincethese depend on the application, we have had to choosevalues for these. We have used a random graph generatorthat generates integer weights for the edges in the range [0–10 000]. For this range of weights, it could be argued that in-teger precision would be sufficient. However, since the max-imum cycle mean is a ratio, it is not restricted to integer val-ues. We have therefore conservatively chosen a precision of0.001 for the binary search (i.e., 10−7 times the maximumedge-weight). Increasing the precision by a factor of 2 re-quires one more run of the negative cycle detection algo-rithm, which would imply a proportionate increase in thetotal time taken for computation of the MCM.

With regard to the ISCAS benchmarks, note that thereis a slight ambiguity in translating the net-lists into graphs.This arises because a D-type flip-flop can either be treated asa single edge with a delay, with the fanout proceeding fromthe sink of this edge, or as k separate edges with unit delayemanating from the source vertex. In the former treatment,it makes more sense to talk about the |D|/|V | ratio (|D| be-ing the number of D flip-flops), as opposed to the |D|/|E|ratio that we use in the experiments with random graphs.However, the difference between the two treatments is notsignificant and can be safely ignored.

We also conducted experiments where we vary the num-ber of edges with delays on them. For this, we need to ex-ercise care, since we may introduce cycles without delays onthem, which are fundamentally infeasible and do not havea maximum cycle mean. To avoid this, we follow the policyof treating edges with delays as “back-edges” in an otherwiseacyclic graph [15]. This view is inspired by the structure ofcircuits, where a delay element usually figures in the feed-back portion of the system. Unfortunately, one effect of thisis that when we have a low number of delay edges, the result-ing graph tends to have an asymmetric structure: it is almostacyclic with only a few edges in the reverse “direction.” It isnot clear how to get around this problem in a fashion thatdoes not destroy the symmetry of the graph, since this re-quires solving the feedback arc set problem, which is NP-hard[25].

One effect of this is in the way it impacts the perfor-mance of the Bellman-Ford algorithm. When the number ofedges with delays is small, there are several negative weightedges, which means that the standard Bellman-Ford algo-rithm spends large amounts of time trying to compute short-est paths initially. The incremental approach, however, is ableto avoid this excess computation for large values of T , whichresults in its performance being considerably faster when thenumber of delays is small.

Intuitively, therefore, for the above situation, we wouldexpect our algorithm to perform better. This is because, forthe MCM problem, a change in the value of P for whichwe are testing the system will cause changes in the weightsof those edges which have delays on them. If these are

10.90.80.70.60.50.40.30.20.10Feedback edge ratio

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Ru

nti

me

(s)

ABF-based MCM computationMCM computation using normal BF routineHoward’s algorithm

Figure 5: Comparison of algorithms for 10 000 vertices, 20 000edges: the number of feedback edges (with delays) is varied as a pro-portion of the total number of edges.

fewer, then we would expect that fewer operations wouldbe required overall when we retain information across iter-ations. This is borne out by the experiments as discussed inSection 5.1.2.

Our experiments focus more on the kinds of graphs thatappear to represent real graphs. By this we mean graphs forwhich the average out-degree of a vertex (number of edgesdivided by number of vertices), and the relative number ofedges with delays on them are similar to those found in realcircuits. We have used the ISCAS benchmarks as a good rep-resentative sample of real circuits, and we can see that theyshow remarkable similarity in the parameters we have de-scribed: the average out-degree of a vertex is close to anda little less than 2, while an average of about (1/10)th orfewer edges have delays on them. An intuitive explanation forthe former observation is that most real circuits are usuallybuilt up of a collection of simpler systems, which predomi-nantly have small numbers of inputs and outputs. For exam-ple, logic gates have typically 2 inputs and 1 output, as doelements such as adders and multipliers. More complex ele-ments like multiplexers and encoders are relatively rare, andeven their effect is somewhat offset by single-input single-output units like NOT gates and filters.

5.1.2 Experimental results

We now present the results of the experiments on randomgraphs with different parameters of the graph being varied.

We first consider the behavior of the algorithms forrandom graphs consisting of 10 000 vertices and 20 000edges, when the feedback-edge ratio (ratio of edges withnonzero delay to total number of edges) is varied from 0 to 1in increments of 0.1. The resulting plot is shown in Figure 5.As discussed in Section 5.1.1, for small values of this ratio,


21.81.61.41.210.80.60.40.20×105

Number of vertices

0

5

10

15

20

25

30

35

40

Ru

nti

me

(s)

MCM using ABFMCM using original BFHoward’s algorithm

Figure 6: Performance of the algorithms as graph size varies: alledges have delays (feedback edges) and number of edges = twice thenumber of vertices.

the graph is nearly acyclic, and almost all edges have nega-tive weights. As a result, the normal Bellman-Ford algorithmperforms a large number of computations that increase itsrunning time. The ABF-based algorithm is able to avoid thisoverhead due to its property of retaining information acrossruns, and so it performs significantly better for small val-ues of the feedback edge ratio. The ABF-based algorithm andHoward’s algorithm perform almost identically in this exper-iment. The points on the plot represent an average over 10random graphs each.

Figure 6 shows the effect of varying the number of ver-tices. The average degree of the graph is kept constant, sothat there is an average of 2 edges per vertex, and the feed-back edge ratio is kept constant at 1 (all edges have delays).The reason for the choice of average degree was explained inSection 5.1.1. Figure 7 shows the same experiment, but thistime with a feedback edge ratio of 0.1. We have limited thedisplayed portion of the Y-axis since the values for the MCMcomputation using the original Bellman-Ford routine rise ashigh as 10 times that of the others and drowns them out oth-erwise.

These plots reveal an interesting point: as the size of thegraph increases, Howard’s algorithm performs less well thanthe MCM computation using the ABF algorithm. This in-dicates that for real circuits, the ABF-based algorithm mayactually be a better choice than Howard’s algorithm. This isborne out by the results of the ISCAS benchmarks.

Figures 8 and 9 show a study of what happens as the edge-density of the graph is varied: for this, we have kept the num-ber of edges constant at 20 000, and the number of verticesvaries from 1000 to 17 500. This means a variation from anedge-density (ratio of the number of edges to the numberof vertices) of 1.15 to 20. In both these figures, we see that

21.81.61.41.210.80.60.40.20×105

Number of nodes

0

5

10

15

20

25

Ru

nti

me

(s)


Figure 7: Performance of the algorithms as graph size varies: pro-portion of edges with delays = 0.1 and number of edges = twice thenumber of vertices (Y-axis limited to show detail).

180001600014000120001000080006000400020000Number of vertices

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

Ru

nti

me

(s)


Figure 8: Performance of the algorithms as graph edge densityvaries: all edges have delays (feedback edges) and the number ofedges = 20 000.

the MCM computation using ABF performs especially wellat low densities (sparse graphs), where it does considerablybetter than Howard’s algorithm and the normal MCM com-putation using ordinary negative cycle detection. In addition,the point where the ABF-based algorithm starts performingbetter appears to be at around an edge-density of 2, which isalso seen in Figure 5.


180001600014000120001000080006000400020000Number of vertices

0

0.5

1

1.5

2

2.5

Ru

nti

me

(s)


Figure 9: Performance of the algorithms as graph size varies: pro-portion of edges with delays = 0.1 and the number of edges =20 000.

We note the following features from the experiments:

(i) If all edges have unit delay, the MCM algorithm thatuses our adaptive negative cycle detection providessome benefit, but less than in the case where few edgeshave delays.

(ii) When we vary the number of feedback edges (edgeswith delays), the benefit of the modifications becomesvery considerable at low feedback ratios, doing betterthan Howard’s algorithm for low-edge densities.

(iii) In the ISCAS benchmarks, we can see that all of the cir-cuits have |E|/|V | < 2, and |D|/|V | < 0.1, (|D| is thenumber of flip-flops, |V | is the total number of circuitelements, and |E| is the number of edges). In this rangeof parameters, our algorithm performs very well, evenbetter than Howard’s algorithm in several cases (alsosee Table 2 for our results on the ISCAS benchmarks).

Table 2 shows the results obtained when we used the dif-ferent algorithms to compute MCMs for the circuits from theISCAS 89/93 benchmark set. One point to note here is thatthe ISCAS circuits are not true HLS benchmarks; they wereoriginally designed with logic circuits in mind, and as such,the normal assumption would be that all registers (flip-flops)in the system are triggered by the same clock. In order touse them for our testing, however, we have relaxed this as-sumption and allowed each flip-flop to be triggered on anyphase; in particular, the phases that are computed by theMCM computation algorithm are such that the overall sys-tem speed is maximized. These benchmark circuits are stillvery important in the area of HLS, because real DSP circuitsalso show similar structure (sparseness and density of delayelements), and an important observation we can make fromthe experiments is that the structure of the graph is very rel-

Table 2: Run time for MCM computation for the 6 largest ISCAS89/93 benchmarks.

Bench- |E|/|V | |D|/|V | Orig. BF ABF Howard’s

mark MCM MCM algo.

s38417 1.416 0.069 2.71 0.29 0.66

s38584 1.665 0.069 2.66 0.63 0.59

s35932 1.701 0.097 1.79 0.37 0.09

s15850 1.380 0.057 1.47 0.18 0.36

s13207 1.382 0.077 0.73 0.12 0.35

s9234 1.408 0.039 0.57 0.06 0.11

s6669 1.657 0.070 0.74 0.07 0.04

s4863 1.688 0.042 0.27 0.04 0.03

s3330 1.541 0.067 0.11 0.02 0.01

s1423 1.662 0.099 0.07 0.01 0.01

evant to the performance of the various algorithms in theMCM computation.

As can be seen in Table 2, Lawler’s algorithm does reason-ably well at computing the MCM. However, when we use theadaptive negative cycle detection in place of the normal neg-ative cycle detection technique, there is an increase in speedby a factor of 5 to 10 in most cases. This increase in speed isin fact sufficient to make Lawler’s algorithm with this imple-mentation up to twice as fast as Howard’s algorithm, whichwas otherwise considered the fastest algorithm in practice forthis problem.

5.2. Search techniques for scheduling

We demonstrate another application of our technique, effi-cient searching of schedules for IDFGs. The basic idea is thatfor scheduling an IDFG, we need to (a) assign vertices to pro-cessors and (b) assign relative positions to the vertices withineach processor (for resource sharing). Once these two aspectsare done, the schedule for a given throughput constraint isdetermined by finding a feasible solution to the constraintequations, which we do using the ABF algorithm. This ideathat the ordering can be directly used to compute scheduletimes has been used previously, for example see [26]. Sincethe search process involves repeatedly checking the feasibil-ity of many similar constraint systems, the advantages of theadaptive negative cycle detection come into play.

The approach we have taken for the schedule search is

(i) start with each vertex on its own processor, find a fea-sible solution on the fastest possible processor;

(ii) examine each vertex in turn, and try to find a placefor it on another processor (resource sharing). In do-ing so, we are making a small number of changes tothe constraint system, and need to recompute a feasi-ble solution;

(iii) in choosing the new position, choose one that has min-imum power (or area, or whatever cost we want to op-timize);

(iv) additional moves that can be made include inserting anew processor type and moving as many vertices onto


it as possible, moving vertices in groups from one pro-cessor to another, and so forth;

(v) the technique also lends itself very well to the applica-tion in schemes using evolutionary improvement [27];

(vi) in the present implementation, to choose among var-ious equivalent implementations at a given stage, weuse a weight based on giving greater importance to im-plementations that result in lower overall slack on thecycles in the system. (Slack of a cycle here refers to thedifference between the total delay afforded by the reg-isters (number of delay elements in the cycle times theclock period) and the sum of the execution times of thevertices on the cycle, and is useful since a large slackcould be taken as an indication of under-utilization ofresources.)

Each such move or modification that we make to thegraph can be treated as a set of edge-changes in the prece-dence/processor constraint graph, and a feasible schedulewould be found if the system does not have negative cycles.In addition, the dist(v) values that are obtained from apply-ing the algorithm directly give us the starting times that willmeet the schedule requirements.

We have applied this technique to attack the multiple-voltage scheduling problem addressed in [28]. The problemhere is to find a schedule for the given DFG that minimizesthe overall power consumption, subject to fixed constraintson the iteration period bound, and on the total number ofresources available. For this example, we consider three re-sources: adders that operate at 5 V, adders that operate at3.3 V, and multipliers that operate at 5 V. For the elliptic fil-ter, multipliers operate in 2 time units, while for the FIR fil-ter, they operate in 1 time unit. The 5 V adders operate in 1time unit, while the 3.3 V adders operate in 2 time units al-ways. It is clear that the power savings are obtained throughscheduling as many adders as possible on 3.3 V adders in-stead of 5 V adders. We have used only the basic resourcetypes mentioned in Table 3 to compare our results with thosein [28]. However, there is no inherent limit imposed by thealgorithm itself on the number of different kinds of resourcesthat we can consider.

In tackling this problem, we have used only the most ba-sic method, namely, moving vertices onto another existingprocessor. Already, the results match and even outperformthat obtained in [28]. In addition, the method has the ben-efit that it can handle any number of voltages/processors,and can also easily be extended to other problems, such ashomogeneous-processor scheduling [15]. Table 3 shows thepower savings that were obtained using this technique. Sand R power saving indicates the power savings (assuming25 units for 5 V devices and 10.89 units for 3.3 V devices) ob-tained by [28], while ABF power savings refers to the resultsobtained using our algorithm (where the ABF algorithm isused to test the feasibility of the system after each move asper the definition above). The overall timing constraint T isthe iteration period we are aiming for.

Table 3 shows some interesting features, the iterative im-provement based on the ABF algorithm (column marked

Table 3: Comparison between the ABF-based search and algorithmof Sarrafzadeh and Raje [26] (×: failed to schedule,— : not avail-able).

Example Resource T Power saved(5 V+, 3.3 V+, 5 V∗) S and R ABF

5th order {2, 2, 2} 25 31.54% 34.86%

ellip. filt. {2, 1, 2} 25 18.26% 16.60%

{2, 2, 2} 22 23.24% 26.56%

{2, 1, 2} 21 13.28% 14.94%

FIR filt. {1, 2, 1} 15 29.45% ×{1, 2, 2} 15 — 34.35%

{1, 2, 1} 16 — 36.81%

{1, 2, 2} 10 17.18% 24.54%

ABF) produced results with significantly higher power sav-ings than the results presented in [28]. One important reasoncontributing to this could be that the iterative improvementalgorithm makes full use of the iterative nature of the graphs,and produces schedules that make good use of the availableinteriteration parallelism. On the other hand, we find thatfor one of the configurations, the ABF-based algorithm is notable to find any valid schedule. This is because the simple na-ture of the algorithm occasionally results in it getting stuck inlocal minima, with the result that it is unable to find a validschedule even when one exists.

Several variations on this theme are possible; the searchscheme could be used for other criteria such as the case wherethe architecture needs to be chosen (not fixed in advance),and modifications such as small amounts of randomizationcould be used to prevent the algorithm from getting stuck inlocal minima. This flexibility combined with the speed im-provements afforded by the improved adaptive negative cy-cle detection can allow this method to form the core of a largeclass of scheduling techniques.

6. CONCLUSIONS

The problem of negative cycle detection is considered in thecontext of HLS for DSP systems. It was shown that impor-tant problems such as performance analysis and design spaceexploration often result in the construction of “dynamic”graphs, where it is necessary to repeatedly perform negativecycle detection on variants of the original graph.

We have introduced an adaptive approach (the ABF al-gorithm) to negative cycle detection in dynamically chang-ing graphs. Specifically, we have developed an enhancementto Tarjan’s algorithm for detecting negative cycles in staticgraphs. This enhancement yields a powerful algorithm fordynamic graphs that outperforms previously available meth-ods for addressing the scenario where multiple changes aremade to the graph between updates. Our technique explic-itly addresses the common, practical scenario in which neg-ative cycle detection must be periodically performed afterintervals in which a small number of changes are made tothe graph. We have shown by experiments that for reason-able sized graphs (10 000 vertices and 20 000 edges) our al-


gorithm outperforms the incremental algorithm (one changeprocessed at a time) described in [4] even for changes madein groups of as little as 4–5 at a time.

As our original interest in the negative cycle detectionproblem arose from its application to the problems describedabove in HLS, we have implemented some schemes thatmake use of the adaptive approach to solve those problems.We have shown how our adaptive approach to negative cy-cle detection can be exploited to compute the maximum cy-cle mean of a weighted digraph, which is a relevant metricfor determining the throughput of DSP system implementa-tions. We have compared our ABF technique, and ABF-basedMCM computation technique against the best known relatedwork in the literature, and have observed favorable perfor-mance. Specifically, the new technique provides better per-formance than Howard’s algorithm for sparse graphs withrelatively few edges that have delays.

Since computing power is cheaply available now, it is in-creasingly worthwhile to employ extensive search techniquesfor solving NP-hard analysis and design problems such asscheduling. The availability of an efficient adaptive nega-tive cycle detection algorithm can make this process muchmore efficient in many application contexts. We have demon-strated this concretely by employing our ABF algorithmwithin the framework of a search strategy for multiple volt-age scheduling.

ACKNOWLEDGMENTS

This research was supported in part by the US National Sci-ence Foundation (NSF) Grant #9734275, NSF NYI AwardMIP9457397, and the Advanced Sensors Collaborative Tech-nology Alliance.

REFERENCES

[1] R. Reiter, “Scheduling parallel computations,” Journal of theACM, vol. 15, no. 4, pp. 590–599, 1968.

[2] A. Dasdan, S. S. Irani, and R. K. Gupta, “Efficient algo-rithms for optimum cycle mean and optimum cost to timeratio problems,” in 36th Design Automation Conference, pp.37–42, New Orleans, La, USA, ACM/IEEE, June 1999.

[3] B. Cherkassky and A. V. Goldberg, “Negative cycle detectionalgorithms,” Tech. Rep. tr-96-029, NEC Research Institute,March 1996.

[4] G. Ramalingam, J. Song, L. Joskowicz, and R. E. Miller, “Solv-ing systems of difference constraints incrementally,” Algorith-mica, vol. 23, no. 3, pp. 261–275, 1999.

[5] D. Frigioni, A. Marchetti-Spaccamela, and U. Nanni, “Fullydynamic shortest paths and negative cycle detection on di-graphs with arbitrary arc weights,” in ESA ’98, vol. 1461of Lecture Notes in Computer Science, pp. 320–331, Springer,Venice, Italy, August 1998.

[6] L.-T. Liu, M. Shih, J. Lillis, and C.-K. Cheng, “Data-flow par-titioning with clock period and latency constraints,” IEEETrans. on Circuits and Systems I: Fundamental Theory and Ap-plications, vol. 44, no. 3, 1997.

[7] G. Ramalingam, Bounded incremental computation, Ph.D.thesis, University of Wisconsin, Madison, Wis, USA, August1993, revised version published by Springer-Verlag (1996) asvol. 1089 of Lecture Notes in Computer Science.

[8] B. Alpern, R. Hoover, B. K. Rosen, P. F. Sweeney, andF. K. Zadeck, “Incremental evaluation of computationalcircuits,” in Proc. 1st ACM-SIAM Symposium on DiscreteAlgorithms, pp. 32–42, San Francisco, Calif, USA, January1990.

[9] E. Lawler, Combinatorial Optimization: Networks and Ma-troids, Holt, Rhinehart and Winston, New York, NY, USA,1976.

[10] J. Cochet-Terrasson, G. Cohen, S. Gaubert, M. McGettrick,and J.-P. Quadrat, “Numerical computation of spectral ele-ments in max-plus algebra,” in Proc. IFAC Conf. on Syst. Struc-ture and Control, Nantes, France, July 1998.

[11] N. Chandrachoodan, S. S. Bhattacharyya, and K. J. R. Liu,“Adaptive negative cycle detection in dynamic graphs,” inProc. International Symposium on Circuits and Systems, vol. V,pp. 163–166, Sydney, Australia, May 2001.

[12] G. Ramalingam and T. Reps, “An incremental algorithm for ageneralization of the shortest-paths problem,” Journal of Al-gorithms, vol. 21, no. 2, pp. 267–305, 1996.

[13] D. Frigioni, M. Ioffreda, U. Nanni, and G. Pasqualone, “Ex-perimental analysis of dynamic algorithms for single sourceshortest paths problem,” in Proc. Workshop on Algorithm Engi-neering, pp. 54–63, Ca’ Dolfin, Venice, Italy, September 1997.

[14] R. K. Ahuja, T. L. Magnanti, and J. B. Orlin, Network Flows,Prentice-Hall, Upper Saddle River, NJ, USA, 1993.

[15] S. M. H. de Groot, S. H. Gerez, and O. E. Herrmann, “Range-chart-guided iterative data-flow graph scheduling,” IEEETrans. on Circuits and Systems I: Fundamental Theory and Ap-plications, vol. 39, no. 5, pp. 351–364, 1992.

[16] K. K. Parhi and D. G. Messerschmitt, “Static rate-optimalscheduling of iterative data-flow programs via optimum un-folding,” IEEE Trans. on Computers, vol. 40, no. 2, pp. 178–195, 1991.

[17] R. E. Tarjan, “Shortest paths,” Tech. Rep., AT&T Bell labora-tories, Murray Hill, New Jersey, USA, 1981.

[18] T. H. Cormen, C. E. Leiserson, and R. L. Rivest, Introductionto Algorithms, MIT Press, Cambridge, Mass, USA, 1990.

[19] A. Dasdan, S. S. Irani, and R. K. Gupta, “An experimentalstudy of minimum mean cycle algorithms,” Tech. Rep. UCI-ICS #98-32, University of California, Irvine, 1998.

[20] K. Mehlhorn and S. Naher, “LEDA: A platform for combi-natorial and geometric computing,” Communications of theACM, vol. 38, no. 1, pp. 96–102, 1995.

[21] K. Ito and K. K. Parhi, “Determining the minimum iterationperiod of an algorithm,” Journal of VLSI Signal Processing, vol.11, no. 3, pp. 229–244, 1995.

[22] S. S. Bhattacharyya, S. Sriram, and E. A. Lee, “Resynchroniza-tion for multiprocessor DSP systems,” IEEE Trans. on Circuitsand Systems I: Fundamental Theory and Applications, vol. 47,no. 11, pp. 1597–1609, 2000.

[23] D. Y. Chao and D. T. Wang, “Iteration bounds of single ratedataflow graphs for concurrent processing,” IEEE Trans. onCircuits and Systems I: Fundamental Theory and Applications,vol. 40, no. 9, pp. 629–634, 1993.

[24] S. H. Gerez, S. M. H. de Groot, and O. E. Hermann, “A poly-nomial time algorithm for computation of the iteration pe-riod bound in recursive dataflow graphs,” IEEE Trans. on Cir-cuits and Systems I: Fundamental Theory and Applications, vol.39, no. 1, pp. 49–52, 1992.

[25] M. R. Garey and D. S. Johnson, Computers andIntractability—A Guide to the Theory of NP-Completeness, W.H. Freeman and Company, New York, NY, USA, 1979.

[26] D. J. Wang and Y. H. Hu, “Fully static multiprocessor arrayrealizability criteria for real-time recurrent DSP applications,”IEEE Trans. Signal Processing, vol. 42, no. 5, pp. 1288–1292,1994.


[27] T. Back, U. Hammel, and H.-P. Schwefel, “Evolutionary com-putation: Comments on the history and current state,” IEEETrans. Evolutionary Computation, vol. 1, no. 1, pp. 3–17, 1997.

[28] M. Sarrafzadeh and S. Raje, “Scheduling with multiple volt-ages under resource constraints,” in Proc. 1999 InternationalSymposium on Circuits and Systems, pp. 350–353, Miami, Fla,USA, 30 May–2 June 1999.

Nitin Chandrachoodan was born on Au-gust 11, 1975 in Madras, India. He receivedthe B.Tech. degree in electronics and com-munications engineering from the IndianInstitute of Technology, Madras, in 1996,and the M.S. degree in electrical engineer-ing from the University of Maryland at Col-lege Park in 1998. He is currently a Ph.D.candidate at the University of Maryland.His research concerns analysis and representation techniques forsystem level synthesis of DSP dataflow graphs.

Shuvra S. Bhattacharyya received the B.S.degree from the University of Wisconsinat Madison, and the Ph.D. degree from theUniversity of California at Berkeley. He isan Associate Professor in the Department ofElectrical and Computer Engineering, andthe Institute for Advanced Computer Stud-ies (UMIACS) at the University of Mary-land, College Park. He is also an AffiliateAssociate Professor in the Department ofComputer Science. The coauthor of two books and the author orcoauthor of more than 50 refereed technical articles, Dr. Bhat-tacharyya is a recipient of the NSF Career Award. His research in-terests center around architectures and computer-aided design forembedded systems, with emphasis on hardware/software codesignfor signal, image, and video processing. Dr. Bhattacharyya has heldindustrial positions as a Researcher at Hitachi, and as a CompilerDeveloper at Kuck & Associates.

K. J. Ray Liu received the B.S. degree fromthe National Taiwan University, and thePh.D. degree from UCLA, both in electricalengineering. He is Professor at the Electri-cal and Computer Engineering Departmentof University of Maryland, College Park.His research interests span broad aspectsof signal processing architectures; multime-dia communications and signal processing;wireless communications and networking;information security; and bioinformatics in which he has pub-lished over 230 refereed papers, of which over 70 are in archivaljournals. Dr. Liu is the recipient of numerous awards includingthe 1994 National Science Foundation Young Investigator, the IEEESignal Processing Society’s 1993 Senior Award, IEEE 50th Vehicu-lar Technology Conference Best Paper Award, Amsterdam, 1999.He also received the George Corcoran Award in 1994 for outstand-ing contributions to electrical engineering education and the Out-standing Systems Engineering Faculty Award in 1996 in the recog-nition of outstanding contributions in interdisciplinary research,both from the University of Maryland. Dr. Liu is Editor-in-Chiefof EURASIP Journal on Applied Signal Processing, and has beenan Associate Editor of IEEE Transactions on Signal Processing, aGuest Editor of special issues on Multimedia Signal Processing of

Proceedings of the IEEE, a Guest Editor of special issue on SignalProcessing for Wireless Communications of IEEE Journal of Se-lected Areas in Communications, a Guest Editor of special issue onMultimedia Communications over Networks of IEEE Signal Pro-cessing Magazine, a Guest Editor of special issue on Multimediaover IP of IEEE Trans. on Multimedia, and an editor of Journal ofVLSI Signal Processing Systems.


Design and DSP Implementation of Fixed-Point Systems

Martin CoorsInstitute for Integrated Signal Processing Systems, Aachen University of Technology, 52056 Aachen, GermanyEmail: [email protected]

Holger KedingInstitute for Integrated Signal Processing Systems, Aachen University of Technology, 52056 Aachen, GermanyEmail: [email protected]

Olaf LuthjeInstitute for Integrated Signal Processing Systems, Aachen University of Technology, 52056 Aachen, GermanyEmail: [email protected]

Heinrich MeyrInstitute for Integrated Signal Processing Systems, Aachen University of Technology, 52056 Aachen, GermanyEmail: [email protected]

Received 31 August 2001

This article is an introduction to the FRIDGE design environment which supports the design and DSP implementation of fixed-point digital signal processing systems. We present the tool-supported transformation of signal processing algorithms coded infloating-point ANSI C to a fixed-point representation in SystemC. We introduce the novel approach to control and data flowanalysis, which is necessary for the transformation. The design environment enables fast bit-true simulation by mapping thefixed-point algorithm to integral data types of the host machine. A speedup by a factor of 20 to 400 can be achieved compared toC++-library-based bit-true simulation. FRIDGE also provides a direct link to DSP implementation by processor specific C codegeneration and advanced code optimization.

Keywords and phrases: fixed-point design, design methodology, data flow analysis, compiled simulation, code optimization.

1. INTRODUCTION

Digital system design is characterized by ever-increasingcomplexity that has to be implemented within reduced time,resulting in minimum costs and short time-to-market. Thisrequires a seamless design flow that allows the execution ofthe design steps at the highest suitable level of abstraction.

For most digital systems, the design has to result in afixed-point implementation, either in HW or SW. This isdue to the fact that these systems are sensitive to powerconsumption, chip size, throughput, and price-per-device.Fixed-point realizations outperform floating-point realiza-tions by far with regard to these criteria.

A typical fixed-point design flow is depicted in Figure 1.Algorithm design starts from a floating-point descriptionthat is analyzed by means of simulation without taking thequantization effects into account. This abstraction from allimplementation effects allows an exploration of the algo-rithm space, for example, the evaluation of different digi-tal receiver structures. This exploration is well supported by

a variety of commercial block-diagram oriented system leveldesign tools [1, 2, 3]. The modeling efficiency on the floating-point level is high and the floating-point models offer a max-imum degree of reusability.

In a next step towards system implementation, a transfor-mation to a bit-true representation of the system is necessary,that is, assigning a fixed word length and a fixed exponent toevery operand. This process is quite tedious and error-proneif done manually: often more than 50% of the implementa-tion time is spent on the algorithmic transformation [4] tothe fixed-point level for complex designs once the floating-point model has been specified.

The major reasons for this bottleneck are as follows:(1) There is no unique transformation from floating-

point to fixed-point.(a) Different HW and SW targets put different con-

straints on the fixed-point specification.(b) Optimization for different design criteria, like

throughput, chip size, memory size, or accuracy are in gen-eral mutually exclusive goals and result in a complex design

Design and DSP Implementation of Fixed-Point Systems 909

ok?ok?

floatingpoint

floatingpoint

ok?ok?

quantizationquantization

fixedpoint

fixedpoint

codingcoding

SW / HWSW / HW

ok?

floatingpoint

ok?

quantization

fixedpoint

coding

SW / HW

Des

crip

tion

tran

sfor

mat

ion

Alg

orit

hm

ictr

ansf

orm

atio

n

Fixe

d-po

int

Floa

tin

g-po

int Floating

point

Ok?

Quantization

Fixedpoint

Ok?

Coding

SW/HW

Design spaceexploration

Evaluation ofthe bit-truebehavior

Implementation

Figure 1: Fixed-point design process.

Program memory/chip size

Quantizationnoise

Throughput

Figure 2: Fixed-point design space.

space as sketched in Figure 2. Furthermore, targets with agiven datapath, for example, DSPs put different constraintson the quantization than ASICs where the datapaths are flex-ible.

(c) The quantization is generally highly dependent on theapplication, that is, on the applied stimuli.

(2) Quantization is a nonlinear process. Analytical mod-els based on signal theory are only applicable for systems witha low complexity [5]. An exploration of the fixed-point de-sign space with respect to quantization noise, performance,and operand word lengths cannot be done without extensivesystem simulation.

(3) Some algorithms are difficult to implement in fixed-point due to high signal dynamics or sensitivity to quanti-zation noise. Thus algorithmic alternatives need to be em-ployed.

Finally, the quantized system is implemented, eitherin hardware or in software on a programmable DSP. The

implementation needs to be optimized with respect to chiparea, memory consumption, throughput, and power con-sumption. Here the bit-true system-level model serves asa “golden” reference for the target implementation whichyields bit-by-bit the same results.

To increase the designer’s efficiency, software tool sup-port for fixed-point design is necessary. Ideally the design en-vironment would have the following features:

(1) A modeling language supporting generic fixed-pointdata types to model the fixed-point behavior of the system.It will also provide a means of data monitoring of variablesand operands during simulation, for example, range, mean,and variance.

(2) A semiautomatic transformation from floating-pointto a bit-true representation. The designer can bring in hisknowledge about the system and he has full control over thetransformation. The tool will accept a set of constraints spec-ified by the designer to model the characteristics of the targethardware.

(3) The ability to perform bit-true simulation with a sim-ulation speed close to floating-point simulation.

(4) A seamless design flow down to system implementa-tion, generating optimized input for DSP compilers.

These requirements have been the motivation forthe Fixed-point pRogrammIng and Design Environment(FRIDGE) [6, 7, 8], an interactive design environment for thespecification, simulation, and implementation of fixed-pointsystems.

In this article we describe the principles and elements ofFRIDGE and outline the seamless design flow as it becomespossible with this design environment. FRIDGE relies on fivemain concepts which are briefly introduced in the following.

1.1. Fixed-point modeling language

DSP system design is frequently done on a PC or a work-station utilizing a C/C++-based system-level design environ-ment. For efficient modeling of finite word length effects,language extensions implementing generic fixed-point datatypes are necessary. ANSI C does not offer such data typesand hence fixed-point modeling using pure ANSI C becomesa very tedious and error-prone task.

Fixed-point language extensions implemented as li-braries in C++ [9, 10, 11] offer a high modeling efficiency.They supply generic fixed-point data types and various cast-ing modes for overflow and quantization handling. The sim-ulation speed of these libraries on the other hand is ratherpoor. Some of these libraries also offer data monitoring ca-pabilities during simulation time.

In the FRIDGE design environment, the SystemC fixed-point data types are used for fixed-point modeling and sim-ulation. A more detailed description of the SystemC fixed-point data types is given in Section 3.

1.2. Interpolative transformation

A central component of the FRIDGE design environmentis the interpolative transformation from a hybrid descrip-tion into a fully bit-true representation. The interpolative


transformation, which is presented in detail in Section 4 usesanalytical range propagation to determine operand wordlengths.

1.3. Data flow analysis

During the development of the FRIDGE design environ-ment, we have identified a need for accurate data flow anal-ysis. The published approaches for static and dynamic pro-gram analysis did not match the requirements of the de-sign environment, thus we have developed a novel approachfor control and data flow analysis, which is presented inSection 5.

1.4. Fast bit-true simulation

Existing C++-based simulation libraries model the fixed-point operands as objects and make extensive use of oper-ator overloading and container data types. Also, for ease ofuse, many decisions are made during run time. These mecha-nisms increase the execution time of fixed-point simulationsby one to two orders of magnitude compared to floating-point arithmetic. This makes the simulation run time a majorbottleneck during the fixed-point design process.

In Section 7 various approaches for fixed-point simula-tion are presented and a methodology for fast bit-true sim-ulation by mapping fixed-point algorithms in SystemC to aninteger based ANSI C algorithm is introduced.

1.5. DSP target mapping

The final step in a float-to-fixed design flow is the implemen-tation of the DSP system, either in hardware or in software.As a case study for targeting a high performance DSP, we havedeveloped a FRIDGE back end which addresses the Texas In-struments TMS 320C62x fixed-point DSP processor and itsC compiler. The back end generates target specific integerC code which exploits the features of the processor and thecompiler to achieve a high efficiency of the compiled code.In Section 9 the FRIDGE C62x back end and the optimiza-tion strategies are presented.

2. THE FRIDGE DESIGN FLOW

The FRIDGE design flow starts from a floating-point algo-rithm in ANSI C. As illustrated in Figure 3, the designer thenannotates single operands with fixed-point attributes. Insert-ing these local annotations results in a hybrid description ofthe algorithm, that is, some of the operands are specifiedbit-true, while the rest remain floating-point. A comparativesimulation of the floating-point and the hybrid code withinthe same simulation environment shows whether the localannotations are appropriate, or if some annotations have tobe modified. The integer word length of the local annota-tions can be derived from operand range monitoring duringsimulation runs. Typically, the designer manually annotatesfunction parameters and key variables, for example, accumu-lator variables, which account for approximately 5% of alloperands.

Floating-pointANSI-C code

Localannotations

Hybrid code

“Hybrid”simulation

Globalannotations

Interpolation

Simulationengine

Fixed-pointcode

Bit-truesimulation

Figure 3: Quantization methodology with FRIDGE.

Once the hybrid program matches the design criteria, theremaining floating-point operands are automatically trans-ferred to fixed-point operands by interpolation. Interpola-tion denotes the process of computing the fixed-point pa-rameters of the nonannotated operands from the informa-tion that is inherent to the annotated operands and the op-erations performed on them. Additionally, the interpola-tor has to observe a set of global annotations, that is, de-fault restrictions for the calculation of fixed-point param-eters. This can be, for example, a default maximum wordlength that corresponds to the register length of the targetprocessor.

The interpolation results in a fully annotated program,where each operand and operation is specified bit-true way.Cosimulating this algorithm with the original floating-pointcode will give an accuracy evaluation—and for changes nowonly the set of local and/or global annotations have/has to bemodified, while the rest is determined and kept consistent bythe interpolator.

Described above are the algorithmic level transformationsas illustrated in Figure 1, that change the behavior or accu-racy of an algorithm. The resulting completely bit-true algo-rithm in SystemC is not directly suited for implementation,thus it needs to be mapped to a target, such as, a proces-sor’s architecture or to an ASIC. This is an implementationlevel transformation, where the bit-true behavior normallyremains unchanged. Within the FRIDGE environment, dif-ferent back ends map the internal bit-true specification todifferent formats/targets, according to the purpose or goal ofthe quantization process.

3. FIXED-POINT DATA TYPES AND LOCALANNOTATIONS

Since ANSI C offers no efficient support for fixed-point datatypes [12, 13], we initially developed the fixed-point lan-guage fixed-C [14] that is a superset of the ANSI C language.It comprises different generic fixed-point data types, cast op-erators, and interpolator directives. The fixed-C language waslicensed to Synopsys, Inc., and Synopsys contributed it as aset of additional fixed-point data types to the Open SystemC


iwl fwl

s

wl

wl : word lengthiwl : integer word lengthfwl : fractional word lengths : sign encoding/sign bit

Figure 4: Fixed-point attributes of a bit-true description.

Initiative (OSCI) [11]. Together with additional fixed-pointlanguage elements from the A|RT Library by Frontier DesignInc., [10] fixed-C has been the base for the development ofthe SystemC fixed-point data types that are now used in theFRIDGE project as well.

The SystemC fixed-point data types are utilized for dif-ferent purposes in the FRIDGE design flow:

• Since ANSI C is a subset of SystemC, the additionalfixed-point constructs can be used as bit-true annotations todedicated operands of the original floating-point ANSI C file,resulting in a hybrid specification. This partially fixed-pointcode can be used for simulation or as input to the interpola-tor.

• The bit-true output of the interpolator is representedin SystemC as well. This allows a maximum transparency ofthe results to the designer, since the changes to the code arereduced to a minimum and the effects of the designer’s direc-tives, such as local annotations in the hybrid code, becomedirectly visible.

The additional fixed-point types and functions are partof a C++ class library that can be used in any design andsimulation environment that are based on or can integrate Cor C++ code (see, e.g., [1, 2, 3].)

For a bit-true and implementation independent specifi-cation of a fixed-point operand, a three-tuple is necessary:the word length wl, the integer word length iwl, and the sign s,as illustrated in Figure 4.

For every fixed-point format, two of the three parameterswl, iwl, and fwl (fractional word length) are independent; thethird parameter can always be calculated from the other two,wl = iwl + fwl.

With a given sign encoding s, we can also compute theminimum and maximum value that the fixed-point for-mat <wl,iwl> can hold. For example, for a two’s comple-ment (tc) signed representation the minimum and maxi-mum compute to

max〈wl,iwl,tc〉

= 2iwl−1 − 2fwl,

min〈wl,iwl,tc〉

= −2iwl−1.(1)

For an unsigned representation (us), on the other hand, theminimum and maximum are

max〈wl,iwl,us〉

= 2iwl − 2fwl,

min〈wl,iwl,us〉

= 0.(2)

Note that an integral data type is merely a special case ofa fixed-point data type with an iwl that always equals wl—hence an integral data type can be described by two parame-ters only, the word length wl and the sign encoding s.

In the following sections, we provide a short overview ofthe most frequently used fixed-point data types and func-tions in SystemC. A more detailed description can be foundin the SystemC users manual [11].

3.1. The data types sc fixed and sc ufixed

The two’s complement data type sc fixed and the unsigneddata type sc ufixed receive their format when they are de-clared, that is, the fixed-point attributes must be known atcompile time (static arguments),

sc_fixed<wl,iwl> d,*e,g[8];sc_ufixed<wl,iwl> c;

Thus they behave according to these fixed-point parame-ters throughout their lifetime. This concept is called declara-tion time instantiation (DTI). Similar concepts exist in otherfixed-point languages as well [9, 10, 15]. Pointers and arrays,as frequently used in ANSI C, are supported as well.

For every assignment to a DTI variable, a data type checkis performed. If the left-hand data type does not match theright-hand data type as illustrated in the code example below,an implicit cast to the left-hand data type becomes necessary,

sc fixed<6,3> a,b;sc ufixed<12,12> c;a = b; /* correct, both types match */c = b;/* type mismatch -> implicit cast necessary */

The data types sc fixed and sc ufixed are the datatypes of choice, for example, for interfaces to other function-alities or for lookup tables, since they behave like a memorylocation of a specific length and a known embedding/scaling.

3.2. The data type sc fxval

Additionally to the DTI data type concept, SystemC providesthe assignment time instantiation (ATI) data type sc fxval.This type may hold fixed-point numbers of arbitrary formatand is especially tailored for the float-to-fixed transformationprocess. A declaration of a variable of type sc fxval doesnot specify any fixed-point attributes and if subsequently inthe code a fixed-point value is assigned to a sc fxval vari-able, the variable is (re-)instantiated with all fixed-point at-tributes of the assigned value.

3.3. The data types sc fix and sc ufix

Along with the static attribute types sc fixed andsc ufixed, SystemC also provides the fixed-point typessc fix and sc ufix that may also take nonstatic fixed-pointattributes such as variables. The function in the code exam-ple below has the word length wl and the integer word lengthiwl as formal parameters, that is, wl and iwl are not knownat compile time.


sc fxval cast func(int wl, int iwl, sc fxval in){return sc fix(in,wl,iwl);}

As shown in this example, the constructor for the typessc fix and sc ufix are often used to cast a value to a dif-ferent fixed-point format.

3.4. Cast modes

For a cast operation to a fixed-point format <wl,iwl,sign>, it is also important to specify the overflow and pre-cision reduction in case the target data type cannot hold theoriginal value:

a = sc_fix(input,wl,iwl,q_mode,o_mode);

The variable a holds a two’s complement fixed-pointformat <wl,iwl> and the value of input is cast to thisfixed-point data type according to the quantization modeq mode1 and the overflow mode o mode.2 The most im-portant casting modes are listed below. SystemC also spec-ifies many additional cast modes to model target specificbehavior.

Quantization modes

Truncation (SC TRN). The bits below the specified LSB are cutoff. This quantization mode is the default for SystemC fixed-point types and will be used if no other value is specified.

Rounding (SC RND). Adds LSB/2 first, before cutting offthe bits below the LSB.

Overflow modes

Wrap-around (SC WRAP). In case of an overflow the MSBcarry bit is ignored. This overflow mode is the default forSystemC fixed-point types and will be used if no other valueis specified.

Saturation (SC SAT). In case the minimum or maximumvalue is exceeded the result is set to the minimum or maxi-mum value, respectively.

With the sc fxval type, every assignment to a variableoverwrites all prior instantiations, that is, one sc fxval vari-able may have different context-specific bit-true attributes inthe same scope. This concept of ATI is motivated by the spe-cific design flow: transformation starts from a floating-pointprogram, where the designer abstracts from the fixed-pointproblems and does not think of a variable as finite length reg-ister.

The concept of local annotations and ATI is also an ef-fective way to assign context specific information withoutchanging structures or variables when exploring the fixed-point design space.

1The quantization handling specifies the behavior in case of a wordlength reduction at the LSB side.

2The overflow handling specifies the behavior in case of a word lengthreduction at the MSB side.

4. INTERPOLATION

The interpolator with its control and data flow analyzer isthe core of the FRIDGE design environment. As depictedin Figure 3 it determines the fixed-point formats for alloperands of an algorithm, taking as input a user annotatedhybrid description of the algorithm and a set of global defaultrules, the global annotation file. Hence interpolation describesthe computation of the fixed-point parameters of the non-annotated operands from the information that is inherent tothe annotated operands.

The interpolative concept is based on three key ideas:(1) Attribute propagation. The method of using the at-

tributes of the bit-true specified operands in the code to cal-culate bit-true attributes for the remaining operands and op-erations in the code.

(2) Global annotations. The description of default rulesand restrictions for attribute propagation.

(3) Designer support. The interpolator supplies feedbackand reports to assist the designer to debug or improve theinterpolation result.

For a better understanding the first two points are ex-plained more detailed in the following.

(1) Attribute propagation. Given the information of thefixed-point attributes of some operands, the type and thefixed-point format of other operands can be extracted fromthis information. For example, if for the inputs to an opera-tion both the range and the relevant fractional word lengthare specified, the same attributes can be determined for theresult.3

Consider the following line of code:

c = a + b; d = 1.5; e = c * d;

The corresponding data flow graph is depicted inFigure 5. We assume that the ranges and the precision of thevariables a and b are known, for example, by user annota-tions:

a ∈ [−0.25, 0.75] =⇒ Ra = [−0.25, 0.75]; fwl(a) = 2,

b ∈ [−1.25, 0.5] =⇒ Rb = [−1.25, 0.5]; fwl(b) = 2.(3)

To receive the range Rc for the variable c that contains thesum of the variables a and b we add the ranges Ra and Rb (adetailed description of the range arithmetic used here can befound in [14]),

Rc = Ra + Rb =[

mina

+ minb,max

a+ max

b

] = [−1.5, 1.25].

(4)The precision Pc (fwl) for the sum c computes to the maxi-mum of the precisions Pa and Pb,

Pc = max(Pa,Pb

) = 2. (5)

The information on the range and on the precision of thevariable c is sufficient to calculate the required word length

3An exception is the division, where the accuracy of the operation mustbe specified as well.


a[−0.25, 0.75]

b[−1.25, 0.5]

c[−1.5, 1.25]

d = 1.5

e[−2.25, 1.875]

∗

+

Figure 5: Example for interpolation of ranges/word lengths.

or integer word length for c. The correlation between fwl,range, and iwl yields the iwl of c:

iwlc =⌈

max(

log2

∣∣minc

∣∣, log2

(∣∣maxc

∣∣ + 2−fwlc))

+ 1⌉

= max(0.58, 0.58) + 1⌉ = 2.

(6)

Thus the resulting format for c is <4,2,tc>, where tcindicates the two’s complement representation of c.

The next step for the interpolator is to compute the fixed-point format of the constant d. Since the range of d is Rd =[1.5, 1.5] and the precision is Pd = fwld = 1 the iwl of d canbe calculated as

iwld =⌈

log2

(max

d+2−fwl)⌉ = ⌈ log2(1.5 + 0.5)

⌉ = 1. (7)

After all fixed-point parameters of the input operands tothe multiplication e = d * c are known to the interpolator,it continues with the calculation of the bit-true format andparameters for the variable e:

Re = Rc ∗ Rd = [−1.5, 1.25]∗ 1.5 = [−2.25, 1.875],

Pe = Pc + Pd = 2 + 1 = 3 =⇒ iwle

= ⌈max(

log2

∣∣mine

∣∣, log2

(∣∣maxe

∣∣ + 2−fwle))

+ 1⌉

= ⌈max(1.17, 1) + 1⌉ = 3.

(8)

Hence we receive a fixed-point format of <6,3,tc> forthe variable e.

Note that this is a rather conservative way of interpola-tion, bits that may contain any information are never dis-carded. For the MSB side this is called a worst case interpo-lation, since with the iwl calculated by the interpolator anoverflow is impossible, while on the other hand it may leadto iwls much larger than actually needed. In this case the de-signer may add additional local annotations to cut back theiwl to a more suited value. For the LSB side this is calledmaximum precision interpolation (MPI) interpolation, thatis, by default every LSB of the operands is kept, maintainingthe highest possible accuracy. LSBs are only discarded if theword length exceeds the maximum word length specified inthe global annotation file. This can lead to a large increase inthe (fwl), but with additional local annotations the designercan also keep the fwl shorter. In [6] we also describe a methodto have the interpolator calculate a less conservative value forthe fwl.

(2) Global annotations. While local annotations expressfixed-point information for single operands, the global an-notations describe default restrictions to the complete de-sign. For different targets, different global restrictions apply.For SW, the functional units to perform specific operationsare already defined by the architecture of the processor. Con-sider a 16 × 16 bit multiplier writing to a 32-bit register. Aglobal annotation can supply the information to the interpo-lator that the word length of a multiplication operand mustnot exceed 16 bits, while the result may have a word length ofup to 32 bits.

4.1. Implementational issues

In a first step the FRIDGE front end parses in the hybrid de-scription into a C++-based intermediate representation (IR).Then range propagation is performed to determine the bit-true format for all the operands. During this process, controland data flow analysis is also carried out. The informationgained is stored in the IR. The advanced algorithms used forthe analysis will be described in Section 5.

After this process the IR holds a bit-true description ofthe algorithm with additional control and data flow infor-mation. These data structures form the basis for additionaltransformation steps performed in the FRIDGE back endsthat target different languages and platforms.

5. ADVANCED DATA FLOW ANALYSIS

During the development of the FRIDGE design environ-ment, we have identified a need for accurate data flow anal-ysis to cater the needs of the interpolation, the fast simula-tion code generation and the target specific code optimiza-tion. The published methods were not capable of matchingthe requirements, thus we have developed a novel approachfor data flow analysis that can provide the necessary data forthe FRIDGE back ends.

Researchers have worked on program analysis techniquessince the 1960s and there is, by now, an extensive literature[16]. There are two major approaches to program analysis:

(a) There are static analysis techniques that analyze theprogram code at compile time. Usually, sets of equationsare set up according to the program semantics and solvedby finding their fixpoint. One of the best known static ap-proaches is Data Flow Analysis. It is treated in depth instandard compiler books [17, 18]. Other techniques suchas constraint-based analysis and abstract interpretation arealso described in [19]. PAG [20] is a tool for generatinginterprocedural data flow analyzers that implement thesetechniques.

(b) On the other hand, there are techniques for dy-namic analysis that are used for examining the behaviorof program code during execution. Typically, these tech-niques are employed by profiling tools. Profiling informa-tion can for example be used by programmers to find crit-ical pieces of code or as input to profile-driven optimiz-ers. Dynamic program analysis techniques have been im-plemented in tools like Pixie [21] or QPT [22]. By princi-ple, dynamic program analysis relies on input vectors to be


processed during execution. Thus the results are of no gen-eral nature.

Analysis techniques of neither category are suited forthe needs of the FRIDGE design environment. Static anal-ysis puts tight constraints onto the code to be analyzed. Theuse of pointers is usually not supported or yields too con-servative results. Implementations of digital signal process-ing systems usually make extensive use of pointers, even, forexample, for iterating over data arrays. Furthermore, staticanalysis is blind for program properties that result from runtime effects. However, especially these properties have to betaken into account by FRIDGE in order to obtain preciseresults.

Dynamic analysis is to some extend capable of detect-ing these properties. Nevertheless, it is not applicable forthe FRIDGE design environment for two reasons. First, theresults are of statistical, numerical nature. There is no wayto gain information about data flow or control flow prop-erties. Second, the results are not generally valid, that is,they only reflect the behavior of the program running onthe given input vectors. FRIDGE requires analysis resultsthat are valid for all possible executions of the programthough.

The requirements for the analysis employed by FRIDGEare different from those of standard tools like, for exam-ple, a general purpose compiler. FRIDGE is focused on digi-tal processing systems. These systems are typically data flowdominated, that is, their execution is to a great extent in-dependent from the data to be processed. Besides, the ac-curacy and quality of the results are more important thanspeed (of analysis). This allows for a more comprehensivecode analysis than, for example, a general purpose com-piler can apply. In order to gain precise results includingalso run time properties and being able to handle pointeroperations, the code is interpreted. Since there is no con-crete data to be processed, we process abstract data instead.In the following this methodology is referred to as abstractexecution.

The data flow analysis unit in the FRIDGE design envi-ronment is based on three main components:

(1) The concept of data abstraction.(2) The state controlled memory model.(3) The concept of coupled iterators.

5.1. Data abstraction

While in concrete execution numeric values are written toand read from memory, we use operations for abstract execu-tion. An operation is a collection of information about possi-ble values. The two most important elements are

(1) the range, that is, the minimum value and the maxi-mum value, and

(2) a reference to the expression in the code that corre-sponds to the operation.4

4This is for gaining data flow information.

Furthermore, operations may be ambiguous. Considerthe code example below.

01 int func(int x, int y, int z){02 int a, b, c, d;0304 switch(y){05 case 1:06 a = 8; break;07 case 2:08 a = 16; break;09 case 3:10 a = 32;}1112 if(z>0)13 b = 0;14 else15 b = 1;1617 if(x>0){18 c = 5;19 d = a;}20 else {21 c = b;22 d = 7;}2324 return c + d;25 }

The only information available about parameters x, y,and z is that they are integers. Hence it cannot be decidedwhich branches of the switch- and if-statements in lines04, 12, and 17 are executed. This results in an ambiguouscontent, for example, of variable b, namely, values 05 and1, referring to the expressions in lines 13 and 15, respec-tively. We combine both operations to an ambiguous opera-tion. In addition, ambiguous operations are associated withconditions, under which the alternatives are chosen. In theexample, alternative 0 is chosen if (z > 0) is true, alter-native 1 if it is false. In general, there may be more thantwo alternatives and conditions may be combined by a logi-cal AND.

Operations are arranged in graphs similar to binary de-cision diagrams introduced by Akers [23], where the nodesembody the ambiguous operations and the leafs the unam-biguous operations.

In general, operations are described by the followingrules:

(i) an operation is either an unambiguous operation or anambiguous operation;

(ii) an unambiguous operation represents a possible con-tent in memory during concrete execution of a pro-gram;

5When talking about a value, we mean an operation with a range degen-erated to a value.


Interpreter read/write

Statecontrolledmemorymodel

Con

trol

Currentstate

Mes

sage

s

Figure 6: Abstract execution.

(iii) an ambiguous operation is associated with a controlflow ambiguity in the code (dashed line in Figure 7)and matches each possible branch to an operation.

Thus these trees do not only contain the alternatives, butalso the conditions under which the alternatives are taken.The conditions are determined by all the ambiguities alongthe path from the root to the alternative. Each ambiguitycontributes to the condition in this way, that the conditionfor the execution of the control flow branch must be fulfilled,that is associated with the link to the next operation on thepath. A logical AND is applied to the contributions of eachambiguity.

For example, the tree in Figure 7 with A3 as its root showsthe ambiguity tree corresponding to variable d in line 24.The path to value 32 (bold line) goes through ambiguitiesA3 and A4. A3 is associated with the if-statement and thepath follows the link that is associated with the true-branch.That yields the condition (x > 0) == true. Further on,the path passes through A4 and follows the link to 32. A4 isassociated with the switch-statement and the link to 32 withcase 3. That yields the condition y == 3. Thus the result-ing condition for A3 taking on the value 32 is6

(x > 0) == true && y == 3

5.2. The state controlled memory model

As illustrated in Figure 6, the state controlled memory Modelserves as a regular memory that can be read and written to.Besides, it is responsible for building the ambiguity trees de-scribed in Section 5.1.

As long as the current state is in initial state, the behaviorof the state controlled memory model does not differ froma regular memory. Once the current state contains a condi-tion, all changes done to memory contents only occur un-der that condition and result in appropriate ambiguity trees.The state is defined by a set of assumptions about the re-sult of particular expressions in the code. A logical AND isperformed on these assumptions. The initial state makes noassumptions at all. Other valid states could for example be“(x > 0) == true” or “(x > 0) == true && y == 3.”During abstract execution, the state can be changed by theinterpreter.

6This notation is according to C syntax.

if (x > 0)

A1

true 5

false true

A2false

0

1

if (z > 0)

switch (y)

A3

true

false

A4

7

case 1:

8

case 2:

16

case 3:

32

StepNr.

Currentstate

(1)

(2)

(3)

(4)

(5)

(6)

(x > 0) == t&&y == 1

(x > 0) == t&&y == 2

(x > 0) == t&&y == 3

(x > 0) == f&&(z > 0) == t

(x > 0) == f&&(z > 0) == f

Figure 7: Iterating over ambiguities.

5.3. Iterating over ambiguities

When abstractly executing statements (Section 5.4) or com-puting the set of all possible evaluations of an expression,7

We have to iterate over the alternatives of ambiguities. This isbasically done by traversing the corresponding tree. However,the current state is taken into account, that is, only those alter-natives are visible, whose conditions are not contradictory tothe current state. Furthermore, when selecting an alternativefrom an ambiguity, the corresponding conditions are—if notyet included—added to the current state. This way, the fol-lowing is achieved: All data couplings are taken into account,that is, no impossible cases are considered. Alternative exe-cutions of statements can be done without further thoughtabout the current state (see Section 5.4).

Selecting an alternative from an ambiguity is done bybuilding a path through the corresponding tree. The end ofthe path is an unambiguous operation. In principle, iterating

7For example, this is done when computing fixed-point parameters of anexpression.


is performed on all successors of an ambiguity first, until itwill be iterated over the alternatives of the ambiguity itself(depth first). When establishing a path through an ambigu-ity, two basic cases have to be considered:

(1) The current state contains a condition respective tothe control flow fork that is associated with the ambiguity.In this case, the path must follow the link that correspondsto the condition and may not be altered. The node would beconsidered a slave node.

(2) The current state does not yet contain a condition re-spective to the control flow branch that is associated with theambiguity. In this case, a possible branch is selected and thepath is extended by the corresponding link. The correspond-ing condition is added to the current state. The node would beconsidered a master node. During further iteration, the pathwill switch to all other links successively. When this is done,the respective condition has to be updated accordingly. Afterthat, the condition is removed from the current state.

The trees in Figure 7 show the contents of variables c(left-hand side) and d (right-hand side) connected to line24 in the code. Figure 7 also illustrates how to iterate overall possible combinations of contents of both variables. Notehow building a path through an ambiguity affects the cur-rent state and how the current state masks the visible alter-natives of ambiguities. First of all value 5 is selected fromambiguity A1. The corresponding condition ((x > 0) ==true) is added to the current state. Thus A1 becomes a mas-ter node. When building the path through A3, A3 becomesa slave node, because the current state already makes an as-sumption about the control flow ambiguity that is associatedwith A3 ((x > 0)). Therefore, the path must follow the linkfrom A3 to A4. Nodes A2 and A4 are associated with differentcontrol flow forks, respectively. They always become masternodes and never affect any other ambiguities. Steps 2 and 3 it-erate over the remaining visible alternatives of the right-handtree. Step 4 switches to the second alternative of master nodeA1 (false). This affects the slave A3 in this way as long as thepath in the left-hand tree goes from A1 to A2 (steps 4 and 5),the only visible alternative of the right-hand tree is 7. In step6 the iteration has been completed.

5.4. Execution of a program

Figure 8 shows how statements are abstractly executed. Thesolid lines represent the control flow of a concrete execu-tion. Abstract execution also follows that control flow. How-ever, statements that depend on ambiguous data are executedmultiple times (dashed lines), once for every possible vectorof the involved ambiguities. The vectors are iterated over asdescribed in Section 5.3. Thus every execution is performedin a different current state, such that changes in memory to-gether with their corresponding states are stored in ambi-guity trees. This algorithm is applied recursively for nestedstatements. Any code constructs can be executed this way.

Although a possibly large number of execution states ex-ists, we found that the run time and the memory consump-tion of the analysis were remarkably low for typical signalprocessing algorithms. In most cases the control and data

Control flow

Statement

Statement

Alternativeexecutions

Figure 8: Abstract executions of sequential statements.

flow analysis was performed in less than one second on a800 MHz PC.

The information gained during abstract execution isstored in the intermediate representation of the algorithm.The FRIDGE back ends, which will be introduced in thenext sections, access this information to perform several codetransformation steps.

6. FAST BIT-TRUE SIMULATION

As pointed out in Section 1, transforming a signal processingalgorithm from a floating-point to a fixed-point requires ex-tensive simulations due to the nonlinear nature of the quanti-zation process. The available C++-based fixed-point libraries[10, 11] offer a high modeling efficiency but the simulationspeed of these libraries on the other hand is rather poor. Thismakes simulation speed a major bottleneck in the fixed-pointdesign process.

Utilizing C-based fixed-point libraries like the ETSI ba-sic arithmetic operations [24] does not overcome this prob-lem as the simulation speed still has a considerable overheadcompared to an equivalent floating-point implementation.

Existing C++-based simulation libraries model the fixed-point operands as objects. In order to offer generic fixed-point data types without word length restrictions, data con-tainer types are used as an internal representation. Bit-trueoperations are performed by operator overloading. Rangechecking, the choice of cast modes and many other decisionsnecessary for correct bit-true behavior are done at simula-tion time. The price for this flexibility and ease of modelingis slow execution speed as the generic fixed-point data typesmodeled by extensive C++ constructs cannot be efficientlymapped to the architecture of the host machine by today’sC++ compilers.

A simulation speedup can be achieved by mapping thefixed-point operands to the mantissa of the floating-pointhardware of the host machine and bit level manipulationsto maintain bit-true behavior. This restricts the maximumword length of the fixed-point operands to the word lengthof the mantissa. This approach has been described by Kimet al. [25] and it is also implemented in the SystemC library[11].


Another mean of speeding up fixed-point simulations isthe use of a hardware accelerator, for example, an FPGA toperform computationally expensive operations. The acceler-ation can be achieved either by utilizing configurable logic orby combining configurable logic with a processor. This ap-proach has been described by De Coster [26]. The mappingof the algorithm to the different hardware units and the datatransfer between the units make additional transformationsteps necessary.

The work described in this article proposes a mapping offixed-point algorithm in SystemC to an integer-based ANSIC algorithm that directly addresses the built-in integer ALUof the host machine. An efficient mapping includes an em-bedding of all fixed-point operands into the host machineregisters, a cast mode optimization and many other aspects,and requires a detailed control and data flow analysis of thealgorithm. Independently from the authors’ work, De Coster[26] proposed a similar method, using DFL [27] as input lan-guage and targeting directly a Motorola DSP65000.

Our work presented here represents a continuation of theresearch results published by Keding et al. [6] and Willems[14] and introduces improved concepts for the mapping pro-cess that result in a considerable simulation acceleration.

For the fast simulation back end we assume that fixed-point attributes are assigned to every operation. The backend also requires the information collected during the con-trol and data flow analysis stored in the IR. After a number ofIR refinements, an ANSI C representation of the algorithmusing only integral data types can be derived from the IR.It is important to note that the transformation in the backend, in contrast to the float-to-fixed transformation in theIR, does not change the behavior of the algorithm. The fullyquantized algorithm coded in SystemC and the integer-onlyANSI C algorithm yield bit-by-bit identical results, makingthe fast simulation back end output ideally suited for fast bit-true simulation on a workstation or PC.

7. TRANSFORMATION TO ANSI C

7.1. The lbp alignment

For the embedding of a fixed-point operand specified by atriple (wl, iwl, sign) into a register of the host machine withthe machine word length (mwl) the minimum requirementis

mwl ≥ wl = iwl + fwl. (9)

Figure 9 illustrates different options for embedding anoperand with a word length of 5 bit into a given mwl of 8.Obviously, for mwl > wl, a degree of freedom for choosingthe location of binary point (lbp) exists:

mwl− iwl ≥ lbp ≥ wl− iwl = fwl. (10)

Beside this degree of freedom, there are also a number ofconstraints for the selection of the lbp:

(i) Interface constraints. For interface elements, such as,function parameters or global variables, the lbp must be de-

mwlwl

iwl fwl

s s s s s s s s s s

Ibp

s s s

mwl : machine word lengthwl : word lengthiwl : integer word lengthfwl : fractional word lengthIbp : location of binary points : sign encoding

Figure 9: Embedding a 5-bit word into an 8-bit register.

fined identically for a function and all calls to this function.Otherwise, the data written to or read from these data ele-ments will be misinterpreted.

(ii) Operation constraints. Each operation has an lbp syn-tax. This lbp syntax may include constraints on the lbp of theoperand(s) of the operation and/or rules for the calculationof the lbp of the result. For example, the operands and theresult of and addition must have the same lbp.

(iii) Control and data flow constraints. Generally, a readaccess to a storage element must use the same lbp as the pre-ceding write access to the storage element. This implies that ifa write operation to a memory location occurs in alternativecontrol-flow branches, the lbp must be at the same positionin both write operations, as no run time information aboutthe lbp is available in a following read operation. The sameapplies to ambiguous write operations to arrays and writeoperations via pointers.

7.1.1 The lbp alignment algorithm

The lbp alignment algorithm implemented in the fast sim-ulation back end is designed to take advantage of the de-gree of freedom described by (10), while meeting the con-straints specified above. Meeting these constraints and main-taining the consistency of the lbps require precise informa-tion about the control and data flow of the algorithm. To ob-tain this information we used the data flow analysis methoddescribed in Section 5. The data flow information is repre-sented basically as define-use (du) chains and use-define (ud)chains [17, 18], with additional and more accurate informa-tion about ambiguous control flow.

Initially, for all operands lbp = fwl is chosen. Thus alloperands are right aligned. In a first step we set the lbps of allinterface elements according to the interface constraints.

Then, in an iterative process, the data flow informationis used to adjust the lbps by insertion of shift operations tomeet the operation constraints and the control and data flowconstraints. The algorithm terminates when all conditions arefulfilled and the lbps did not change during the last iteration.

The operation constraint lbp alignment algorithm basi-cally consists of an iteration over all operations and an ad-justment of the operand and result lbps according to the op-eration’s lbp syntax.

The control and data flow constraint lbp alignment algo-rithm searches for all read accesses from a data element theassociated previous write accesses to the same data element,that is, finding all defines for a use of a data element (ud-


chains). According to the control and data flow constraintsthe lbp of operands linked by such ud-chains are set to thesame value.

Finally, the embedding of constants can be done in a waythat the required shift operations when using the constantare minimized.

Unlike described by Kum et al. [28], we do not use a shiftoperation minimizing approach here, but using the degreeof freedom in choosing a suited lbp (10) and the accuratedata flow information, we found that there is not sufficientpotential for this optimization to justify the effort.

7.2. Data type selection

The next step in the transformation process is the selectionof suitable integral data types for fixed-point variables. TheFRIDGE internal bit-true specification of the algorithm fea-tures arbitrary word lengths. With the SystemC back end thisdoes not represent a problem, since the SystemC data typesare generic and may be of any bit length required. With thefast-simulation back end, on the other hand, we only have thelimited pool of the built-in data types of the host machine,that is, integral data types like char, short, int, long.

7.2.1 Basic constraints for any data element

A matching data type for every fixed-point variable has to bechosen. The minimum requirement for the data type chosenis that it can be embedded into the host machine data typewith word length mwl at the correct location, (see Figure 9for illustration) iwl + lbp ≤ mwl.

7.2.2 Structural constraints

Additionally, the requirements introduced by data structuresthat force each of their elements to be of the same data typehave to be met. An example for this behavior are arrays. Thetarget data type for theN elements of an array must fulfill thefollowing condition: maxN−1

i=0 (iwlarray[i] + lbparray[i]) ≤ mwl.

7.2.3 Semantical constraints

Another constraint becomes important if aliasing of data el-ements, for example, by pointers occurs: a pointer may pointto different data elements. For syntax and semantics reasonsall aliased data elements and the base type of the pointer mustbe identical [13]. This only causes a problem if data typesare changed like it is done in fixed-point optimizations orthe floating-point to fixed-point transformation process de-scribed in Section 2: initially, most numerical data types arefloating-point types but after the transformation there arevarious different fixed-point data formats. Hence special caremust be taken during the code generation process to ensurethat the types are consistent. A detailed description of thedata type selection algorithm used can be found in [29].

7.3. Cast mode transformation

Cast operations can reduce or limit the word length on theMSB side of a word (overflow handling) or at the LSB side ofa word (quantization handling). They are used either to pre-

vent indeterministic behavior of fixed-point systems8 or tomodel a data path that is different from the host machine.This is often the case when algorithms for DSP systems aredeveloped. Fixed-point libraries like in SystemC offer variousgeneric overflow and quantization handling modes, whichmakes SystemC an efficient means of modeling fixed-pointsystems. For fast fixed-point simulation, on the other hand,the use of these generic casting modes are simply ruled outfor performance reasons.

7.3.1 Overflow handling

Overflow handling is required if it is necessary to reduce thewl at the MSB side of the word or if the carry bit is set for theMSB. Examples for frequently used overflow handling modesin digital signal processing algorithms are wrap-around andsaturation [30].

Saturation

In SystemC, a cast of an expression expr to a wl-bit tc datatype with integer word length iwl applying saturation as over-flow mode can be modeled as follows:

result = sc_fix(expr,wl,iwl,...,SC_SAT);

The fast simulation code generation on the other hand trans-lates this into plain C code that first tests if the range of datatype is exceeded, and if so it sets the resulting value to theminimum or maximum of this type, which is

MAXwl,iwl,lbp,tc

= 2iwl+lbp−1 − 2lbl−fwl,

MINwl,iwl,lbp,tc

= −2iwl+lbp−1 + 2lbl−fwl − 1.(11)

Thus the fast simulation code construct generated is the fol-lowing:9

int tmp;result=((tmp=expr)>MAX)?MAX:(tmp<MIN)?MIN:tmp;

Introducing an additional temporary variable avoids multi-ple evaluations of expr.

Wrap-Around

The SystemC way of casting an expression expr to a wl-bit tcdata type with integer word length iwl applying wrap-aroundas overflow mode is shown here,

result = sc_fix(expr,wl,iwl,...,SC_WRAP);

For the bit-true ANSI C equivalent of this operation sev-eral options exist. An example for a code construct for wraparound assuming two’s complement arithmetic and a ma-chine word length of mwl is

8In many cases, the ANSI C standard [13] does not specify the bit-truebehavior of integral data types in case of overflow, quantization, and soforth.

9Note that for the code generation we also take the bit-true properties ofthe processor and compiler into account.


result = (expr << SHIFT) >> SHIFT;

The amount of shifts computes to SHIFT = mwl− iwl−lpb. The shift left eliminates the MSBs whereas the arithmeticshift right provides a sign extension for the new MSB.

7.3.2 Quantization handling

If the word length of an operand is reduced at the LSB side,we can apply different quantization handling modes. Themost frequently encountered are rounding and truncation.

Rounding

In SystemC the method for casting an expression expr to awl-bit two’s complement data type with integer word lengthiwl applying rounding as quantization mode is

result = sc_fix(expr,wl,iwl,SC_RND,...);

Rounding is defined by adding DELTA = LSB/2 to theoperand and eliminating the LSBs, for example, by shifting itright SHIFT = lbp − fwl bits. Thus the rounding operationcan be realized in the fast simulation code by

result = ((expr + DELTA)>>SHIFT)<<SHIFT;

Truncation

The truncation operation, given in SystemC by

result = sc_fix(expr,wl,iwl,SC_TRN,...);

can be implemented efficiently by a bit mask operation,

result = expr & (~MASK);

Where MASK is given by 2lpb−fwl−1.For several combinations of cast modes, for example,

wrap-around combined with rounding or truncation, moreefficient joint quantization and overflow handling C codeconstructs are generated. The shift operations introduced bythe cast code constructs are also utilized to adjust the lbp ofthe expression, eliminating the need for additional scalingshifts.

8. EXPERIMENTAL RESULTS

The code generated by the FRIDGE fast simulation backend has been benchmarked against the fixed-point simula-tion classes, which are part of the C++-based SystemC lan-guage. The simulation classes offer two simulation modes: amode supporting unlimited fixed-point word lengths basedon concatenated data containers and a mode supporting lim-ited precision up to 53 bits based on float-arithmetic and bitmanipulations.

The benchmarks have been performed on a SUN Ul-tra 10 workstation running SOLARIS using the GCC com-piler version 2.95.2 with the -O3 option. The SystemC li-brary version 1.0 was utilized for the bit-true simulations.The benchmark is based on typical signal processing ker-nels, FIR 17-tap FIR filter, DCT 8× 8 JPEG DCT algorithm,

Autocorr 25 elements 5th order autocorrelation, IIR 3rdorder IIR filter, FFT complex FFT of length 8, Matrix 4 × 4matrix multiplication.

Four different versions of the kernel functions have beenbenchmarked:

(i) Floating-Point. The execution speed of the floating-point implementation of the algorithms serve as reference forthe benchmarks.

(ii) SystemC. The quantized bit-true version of the algo-rithms utilizing the SystemC fixed-point data types. The al-gorithms have been quantized using the FRIDGE design en-vironment.

(iii) SystemC limited precision. The quantized bit-truecode has been compiled with the limited precision option tospeed up SystemC fixed-point operations.

(iv) Fast simulation code. The fast fixed-point simulationcode based on integral data types has been generated by theFRIDGE back end applying the transformation techniquesdescribed in the previous sections. The code yields bit-by-bitthe same results as the code utilizing the SystemC data types.

The experimental results are presented in Table 1. As thefloating-point code has been used as a reference, the exper-imental data has been scaled relative to the execution speedof the floating-point code. The bit-true SystemC code con-sumes by a factor of 325 to 1103 more run time than the orig-inal floating-point code, making bit-true simulation a majorbottleneck in the fixed-point design flow. Utilizing the lim-ited precision mode of the SystemC library, a speedup by afactor of 3.1 · · · 5.2 can be achieved, but the fixed-point codeis still by a factor of 67 · · · 234 slower than the floating-pointreference.

The fast simulation code runs by a factor of 18.8 · · · 90.9faster compared to the SystemC fixed-point code utilizingthe limited precision option. For the unlimited precision thespeedup is 91.0 · · · 454.2, respectively.

Compared to the floating-point reference code, the fastsimulation code is by a factor of 2.5 · · · 6.9 slower. This isdue to the host system’s architecture and additional shift andbit mask operations necessary to perform lbp-alignment andcast operations to maintain bit-by-bit consistency with thequantized code.

The quantized DCT algorithm contains many cast oper-ations to reduce fixed-point word lengths introduced by thequantization process. As these operations can be modeled ef-ficiently by bit mask operations in the fast simulation code,the highest speedup was achieved for this kernel function.

9. DSP CODE GENERATION

During the recent years, new architectural approaches forDSP processors have been made. The current generation ofhigh performance DSP processors features a pipelined VLIWarchitecture (very long instruction word), which offers a veryhigh computing performance if a high degree of softwarepipelining in combination with instruction level parallelismis used. But programming these processors manually utiliz-ing assembly language is a very tedious task. In awareness ofthis problem, the modern DSP architectures have been de-


Table 1: Relative execution speed.

Floating-point ANSI C SystemC SystemC limited precision Fast simulation code

FIR 1.0 386.5 102.7 2.8

DCT 1.0 1103.1 233.9 2.5

Autocorr 1.0 694.6 130.6 6.9

IIR 1.0 371.0 120.2 3.1

FFT 1.0 354.7 67.7 2.6

Matrix 1.0 325.9 71.2 3.6

veloped using a processor/compiler codesign methodologywhich led to compiler-efficient processor designs.

On the other hand, a significant gap in the system designflow is still evident; there is no direct path from a floating-point system level simulation to an optimized fixed-pointimplementation. Today a manual implementation on theDSP and target specific code optimization is necessary, in-creasing time-to-market and making design changes very te-dious, error prone, and costly. Thus we have developed anoptimizing FRIDGE back end to generate target optimizedDSP C code. The target specific code generation is necessaryfor two reasons:

(i) The generic fixed-point data types used for fixed-point simulations are not suited for DSP implementation, asthe currently available DSP compilers do not support C++fixed-point data types. The upcoming generation of DSPcompilers will support C++ language constructs, but com-piling the fixed-point libraries for the DSP is no viable al-ternative as the implementation of the generic data typesmakes extensive use of operator overloading, templates, anddynamic memory management. This will render fixed-pointoperations rather inefficient compared to integer arithmeticperformed on a DSP.

(ii) Compiling the FRIDGE-generated integer ANSI Ccode on a DSP is also not sufficiently efficient as the genericC code does not exploit the capabilities of the DSP hardwaresuch as built-in saturation and rounding logic or SIMD pro-cessing.

As a case study, we have chosen the TMS320C62x pro-cessor and its C compiler as a target for the FRIDGE de-sign environment.This enables a seamless design-flow fromfloating-point to optimized C62x C code utilizing integraldata types. Generating a C62x optimized version of a signalprocessing algorithm using a different set of fixed-point pa-rameters becomes a matter of hours instead of days or weeksusing the conventional manual techniques. The C62x integercode generated by the design environment yields bit-by-bitthe same results as the fixed-point code utilizing C++ simu-lation classes on the host machine. Thus a comparative sim-ulation to the “golden reference model” gives the designer ahigh degree of confidence in the generated code.

The first objective of our case study was to find outwhich C code constructs compile into efficient C62x as-sembly code. Thus we applied the DSPstone benchmarkingmethodology to the C62x optimizing C compiler. The DSP-stone project [31], conducted in 1994 by ISS, Aachen Uni-

versity of Technology established a benchmarking method-ology for DSP compilers by comparing the performance ofcompiled C code to hand optimized assembly code in termsof program/data memory consumption and execution time.As a consequence, it allows to identify a possible mismatchbetween architecture and compiler. The benchmarking hasbeen done using eleven typical signal processing algorithms(FIR, FFT, DCT, minimum error search, etc.). The bench-marking gives quantitative results for cycle count and pro-gram memory consumption.

In a second step, we used C62x specific C language ex-tensions (intrinsics) and compiler directives to restructurethe off-the-shelf C code while maintaining functional equiv-alence to the original code. These optimizations led to a con-siderable improvement in performance in many cases as thecompiler was able to utilize software pipelining and instruc-tion level parallelism to speed up the code. It has turned outthat software pipelining is the key to achieving a high per-formance but, on the other hand, requires careful analysisand code restructuring. The evaluation [32] gave quantita-tive performance data for the C62x compiler and a set of codeoptimization techniques to generate efficient C62x C code.

In a third step, we benchmarked various implementa-tions of the fixed-point quantization and overflow handlingmodes on the C62x. This led to a set of optimized implemen-tations for the quantization and overflow handling function-ality.

9.1. DSP code transformation

The FRIDGE C62x back end performs similar transforma-tion steps as the fast bit-true simulation code generationpresented in Section 6: lbp alignment, cast mode transfor-mation, and data type selection. Additionally, target specificcode optimization is performed.

The designer has to keep the special requirements ofthe DSP target in mind to reach a high level of efficiency.Through our experiments we found that, for example, thenumber of cast statements and shift operations has a stronginfluence on the efficiency of the generated code. Thus if thedesigner chooses settings for the global annotations and thedefault cast mode during the early stages of the transforma-tion which do not represent the properties of the target ar-chitecture properly, the code optimization and the DSP com-piler are not able to generate efficient assembly code.

The optimizations performed in the FRIDGE C62x backend are source level transformations to supply the C62x com-


piler with the best C code possible. The amount of analy-sis done in an optimizing compiler is usually limited due toconstraints of the time used for compilation. In the FRIDGEdesign environment, control and data flow analysis is per-formed with the maximum possible accuracy utilizing thetechniques presented in Section 5. The information gainedduring this analysis is available for the back end code trans-formation as well. Thus we are able to perform code restruc-turing techniques, which are usually beyond the scope of anoptimizing compiler.

9.1.1 The lbp alignment

As the TI C6000 processor family has an integer multiplica-tion mode, the right alignment strategy of the lbp alignmentalgorithm can also be applied in the C62x back end. This al-gorithm implicitly minimizes the number of scaling shifts. Incontrast to the fast bit-true simulation, the number of scal-ing shifts generated is important for the C62x code genera-tion. For the fast simulation code generation we found thepotential of shift minimization limited to a performance im-provement of 3% · · · 13% [29]. This is different for the C62xcode generation. As the C62x can perform two scaling shiftoperations per cycle, a shortage of functional units limits theperformance in highly software pipelined loops. Thus “shiftpoisoning” of loops must be avoided, for example, by choos-ing suitable fixed-point data types for function parametersand central data structures.

9.1.2 Data type selection

As the properties of the data paths of the C62x processor andthe width of the integral data types supported by the C62xC compiler are known, the design environment can utilizethis information during the transformation process. A set ofglobal annotations for the C62x guides the interpolation pro-cess and a set of integral data types with a given bit length issupplied to the C62x back end.

9.1.3 Cast mode transformation

The generic overflow- and quantization handling modes of-fered by SystemC have to be mapped to the target hardwarein an efficient manner. The C62x offers built-in saturationhardware which can be used by the back end. This is illus-trated by the following example.

Cast mode: saturationA cast of an expression to a wl-bit two’s complement datatype with integer word length iwl applying saturation as over-flow mode is modeled in SystemC as follows:

result=sc_fix(expr,wl,iwl,...,SC_SAT);

An implementation of this code construct in genericANSI C is

int tmp;result=((tmp=expr)>MAX)?MAX:(tmp<MIN)?MIN:tmp;

On the C62x the sshl intrinsic (saturating shift left) canbe used to perform the saturation operation:

result=(signed)_sshl(expr,SHIFT)>>SHIFT;

where SHIFT is given by mwl−(iwl+lbp). Utilizing the built-in saturation hardware of the C62x via the sshl intrinsic al-lows the generation of code with linear control flow in con-trast to the forked control flow in the ANSI C implementa-tion. This significantly speeds up the code.

9.1.4 Loop optimizations

The key to high execution speed on the C62x is softwarepipelining and instruction level parallelism. This is especiallyimportant for loops, where most of the execution time isspent for most digital signal processing algorithms. The latestversion of the C62x C compiler is able to perform quite so-phisticated loop optimizations to achieve high performance.This can be further improved by restructuring the loops atsource level, applying techniques like loop unrolling, scalarexpansion and splitting data paths. By introducing SIMD(single instruction multiple data) intrinsics it is possible toreduce the required number of load/store operations signif-icantly. The C62x back end utilizes the data- and controlflow information and the code transformation infrastructureto identify possible loop optimizations and to perform thenecessary loop restructuring. The design environment main-tains the consistency of generated code.


We have benchmarked the cycle count performance of thegenerated C62x integer C code using two sets of typical sig-nal processing kernel functions: The first set consists of sixoff-the-shelf kernels which have been initially coded with-out DSP specific code optimization. The second set of kernelshas been extracted from TI’s C6000 compiler benchmarkingsuite.

10.1. Off-the-shelf kernels

This set of kernels consists of six signal processing functions,which also have been used for the benchmarks in Section 8:FIR, DCT, Autocorr, IIR, Matrix, Dotprod. The code has beentranslated using TI’s C6x compiler version 4.0 [33] and theperformance has been compared with three reference codes:

(i) C67x floating-point C code. The C67x floating-pointDSP is code-compatible to the C62x and its C compiler ismostly identical to the C62x C compiler, thus the perfor-mance of the generated fixed-point C code can be comparedto the original floating-point C code.

(ii) C62x floating-point emulation. The floating-pointemulation library which is part of the C62x compiler’s runtime library allows the user to perform floating-point arith-metic on the C62x processor. The floating-point operationsare executed as function calls.

(iii) C62x integer ANSI C code. The FRIDGE back end al-lows the designer to generate ANSI C fixed-point code with-out C62x specific optimization. This code can also be com-piled and executed on the C62x processor. The efficiency ofthe target specific code optimization can be benchmarked us-ing this code.


Table 2: Cycle count.

Floating-point Float emulation Generic ANSI C Target specific C

Device C67x C62x C62x C62x

FIR 132 1304 523 234

DCT 331 34163 1509 622

Autocorr 564 6581 3057 1041

IIR 73 708 82 81

Matrix 108 4999 1600 233

Dotprod 95 9436 1300 406

0%

200%

400%

600%

800%

1000

%

1200

%

1400

%

1600

%

Floating-point C67×ANSI-C C62×Optimized C62×

Dotprod

Matrix

IIR

Autocorr

DCT

FIR

427%1368%

100%

216%1481%

100%

111%112%100%

185%542%

100%

188%456%

100%

177%396%

100%

Figure 10: Cycle count relative to floating-point code.

Table 2 presents the benchmarking results for the six ker-nel functions. Figure 10 illustrates the relative cycle count. Asthe C67x floating-point code has been used as a reference, itwas scaled to 100%. For readability the results of the floating-point emulation have been omitted in the bar graph.

As depicted in Table 2 the C62x floating-point softwareemulation has a cycle count which is by a factor of 9.7 to 103higher than the cycle count of the same code compiled forthe floating-point processor.

The generic ANSI C integer code without C62x specificlanguage extensions is by a factor of 1.1 to 14.8 slower thanthe floating-point code. The integer code performs addi-tional shift- and bit-masking operations to ensure the bit-true behavior. Some of the cast-operations cannot easily bemodeled in generic ANSI C. Thus a significant overhead isintroduced for kernel functions where many cast operationsare inserted by the interpolation (e.g., the DCT).

The performance can be improved by matching the gen-erated code to the target architecture. For example, utilizingthe sshl intrinsic is a convenient way to access the C62x sat-

uration hardware directly. This reduces the overhead intro-duced by the additional shift and cast operations to a factorof 1.1 to 4.3 compared to the floating-point code.

For the floating-point code of the Dotprod kernel func-tion, the compiler was able to generate efficient code using95 cycles for 64 vector elements. For the fixed-point code, theadditional operations needed for cast operations in the innerloop prevent the compiler from achieving similar efficiency.Removing all scaling shifts and overflow protection from theinner loop of the fixed-point code for this kernel yields a cy-cle count of 83. Introducing a single scaling shift in the innerloop brings the cycle count up to 147, adding overflow pro-tection yields 406 cycles. Similar effects appear in the Matrixkernel benchmark.

10.2. TI compiler benchmarking kernels

This set of kernels consists of six signal processing functions:IIR 16-coefficient IIR filter, IIR cas biquads 10 cascaded bi-quads, FIR 10-tap 40 sample FIR filter, MAC VSELP two 40samples vectors, VQ MSE MSE between two 256 element vec-tors, VEC SUM vector sum of two 44 sample vectors.

For these kernels hand-optimized C62x assembly codeand C62x integer C code is available on TI’s website. It isnoteworthy that neither the C code nor the assembly codewas coded with overflow protection. For the embedding ofinput and output operands, implicit assumptions were madewhich reduced the number of scaling shifts in the kernelfunctions. Thus the hand-optimized C62x assembly code canserve as an “upper bound” for the efficiency of the FRIDGEC62x design flow.

We derived the floating-point code from the integer Ccode. The function interfaces in the floating-point code weremanually annotated with fixed-point specifications to get hy-brid code. The hybrid code was used as input to generate op-timized C62x integer code from the FRIDGE C62x environ-ment. The FRIDGE generated C62x code features full over-flow protection and maintains consistency for the “locationof binary point” for input and output operands. The code hasbeen translated using TI’s C6x compiler version 4.0 [33] andthe performance has been compared to the reference codes:

(i) C67x floating-point C code. This is the floating-pointcode compiled for the C67x processor.

(ii) C62x hand-optimized integer C code. This is the orig-inal hand-optimized code from the benchmarking suite.


Table 3: Cycle count.

Floating-point Assembly Hand optimized ANSI C FRIDGE

Device C67x C62x C62x C62x

IIR 85 42 38 72

IIR BIQUAD 149 70 82 108

FIR 315 237 278 373

MAC VSELP 175 61 59 207

VQ MSE 559 279 275 275

VEC SUM 63 48 51 127

0% 50% 100% 150% 200% 250%

Floating-point C67×Hand-optimized assembly

Hand-optimized C

FRIDGE

VEC SUM

VQ MSE

MAC VSELP

FIR

IIR BIQUAD

IIR

202%76%81%

100%

49%49%50%

100%

153%34%35%

100%

103%88%

75%100%

72%55%

47%100%

85%45%49%

100%

Figure 11: Cycle count relative to floating point code.

(iii) C62x hand-optimized assembly code. The hand-optimized assembly code served as a reference for the bench-marks.

Table 3 presents the benchmarking results for the six ker-nel functions. Figure 11 illustrates the relative cycle count.For consistency, the floating-point code has been used as areference, it was scaled to 100%.

For these kernels, the C6x compiler was obviously ableto generate very efficient code. For consistency we have mea-sured the cycle count including the function call. This causesthe hand-optimized C code to be faster than the hand-optimized assembly code for some kernels. The floating-point code is slower than the hand-optimized assembly andC code in all cases as the floating-point instructions needmore execution stages than their integer counterparts. For

this set of kernel functions the FRIDGE generated code con-sumes more cycles than the hand-optimized code as addi-tional shift and cast operations for overflow protection areperformed. For some kernels, such as, the MAC VSELP andthe VEC SUM, this leads to a significant overhead as thehand-optimized code uses the processor’s functional units ina very efficient manner. Introducing additional shift and bitmask operations in the innermost loop slows down the code,as no unused functional units are available in the very tightloop pipelining schedule. Especially the s-unit which per-forms shift operations is heavily used and becomes the per-formance bottleneck. Nevertheless, the FRIDGE generatedcode comes very close in performance to the hand-optimizedcode while offering full overflow protection and maintainingconsistency of input and output data formats.


11. SUMMARY

The FRIDGE design environment presented in this articleallows the designer to concentrate on the critical issues offloating-point to fixed-point design flow. Thus he is able toexplore the design space more efficiently. The interpolativetransformation which is based on analytical range propaga-tion enables an accelerated development cycle and in conse-quence a shorter time-to-market.

The fast simulation code generation as well as the DSPback end benefits directly from the advanced control anddata flow analysis techniques we developed. The concept ofabstract execution, in combination with a state-driven mem-ory model and coupled iterators, yields results with the pre-cision necessary for the back end transformation steps.

The verification of the fixed-point algorithm has to beperformed by means of simulation. Existing C++-basedfixed-point libraries increase simulation-time by up totwo orders of magnitude compared to the correspondingfloating-point simulation. The FRIDGE fast simulation backend applies advanced compile-time analysis concepts, ana-lyzes necessary casting operations, and selects the appropri-ate built-in data type on the host machine, thus a speedupby a factor of 20 to 400 compared to the SystemC code whilemaintaining bit-by-bit equivalence was achieved.

The target specific C code generation provides a directlink from a floating-point code to C62x C code using inte-gral data types. The generated code yields bit-by-bit the sameresults as the bit-true SystemC code for host simulation,enabling comparative simulation to the reference model.As proven by the experimental data, the generated C62x Ccode comes very close to hand-optimized C- and assemblycode.

These features make FRIDGE a powerful design environ-ment for the specification, evaluation, and implementationof fixed-point algorithms.

REFERENCES

[1] Synopsys Inc., “CoCentric System Studio—User’s Manual,”Mountain View, Calif, USA.

[2] Mathworks Inc., “Simulink Reference Manual,” March 1996.[3] Cadence Design Systems, 919 E. Hillsdale Blvd., “SPW User’s

Manual,” Foster City, Calif, USA.[4] T. Grotker, E. Multhaup, and O. Mauss, “Evaluation of

HW/SW tradeoffs using behavioral synthesis,” in Proc. Int.Conf. on Signal Processing Application and Technology, Boston,Mass, USA, October 1996.

[5] B. Liu, “Effect of finite word length on the accuracy of digitalfilters—a review,” IEEE Trans. on Circuit Theory, vol. 18, no.6, pp. 670–677, 1971.

[6] H. Keding, M. Willems, M. Coors, and H. Meyr, “FRIDGE:A fixed-point design and simulation environment,” in Proc.European Conference on Design, Automation and Test, pp. 429–435, Paris, France, February 1998.

[7] M. Willems, V. Bursgens, and H. Meyr, “FRIDGE: Floating-point programming of fixed-point digital signal processors,”in Proc. Int. Conf. on Signal Processing Application and Technol-ogy, pp. 1000–1005, San Diego, Calif, USA, September 1997.

[8] M. Willems, V. Bursgens, H. Keding, T. Grotker, and H. Meyr,“System level fixed-point design based on an interpolative ap-

proach,” in Proc. Design Automation Conference, pp. 293–298,Anaheim, Calif, USA, June 1997.

[9] S. Kim, K. Kum, and W. Sung, “Fixed-point optimization util-ity for C and C++ based digital signal processing programs,”in Workshop on VLSI and Signal Processing ’95, pp. 197–206,Osaka, Japan, November 1995.

[10] Frontier Design Inc., “A|RT Library User’s and ReferenceDocumentation,” Danville, Calif, USA, 1998.

[11] Synopsys Inc., CoWare Inc., Frontier Design Inc., “SystemCUser’s Guide, Version 2.0,” 2001.

[12] W. Sung and K. Kum, “Word-length determination and scal-ing software for a signal flow block diagram,” in Proc. IEEEInt. Conf. Acoustics, Speech, Signal Processing, pp. 457–460,Adelaide, Australia, April 1994.

[13] B. W. Kernighan and D. M. Ritchie, The C Programming Lan-guage, Prentice-Hall, Englewood Cliffs, NJ, USA, 2nd edition,1988.

[14] M. Willems, A methodology for the efficient design of fixed-point systems, Ph.D. thesis, Aachen University of Technology,1998.

[15] Mentor Graphics, “DSP Station User’s Manual,” San Jose,Calif, USA.

[16] C. Hankin, “Program analysis tools,” International Journal onSoftware Tools for Technology Transfer, vol. 2, no. 1, pp. 6–12,1998.

[17] A. Aho, R. Sethi, and J. Ullman, Compilers, Principles, Tech-niques and Tools, Addison-Wesley, Reading, Mass, USA, 1986.

[18] M. J. Wolfe, High Performance Compilers for Parallel Comput-ing, Addison-Wesley, Redwood City, Calif, USA, 1996.

[19] C. Hankin, F. Nielson, and H. R. Nielson, Principles of Pro-gram Analysis, Springer, Heidelberg, Germany, 1999.

[20] F. Martin, “PAG—an efficient program analyzer generator,”International Journal on Software Tools for Technology Transfer,vol. 2, no. 1, pp. 46–67, 1998.

[21] MIPS Computer Systems, “UMIPS-V Reference Manual(Pixie and Pixstats),” Sunnyvale, Calif, USA, 1990.

[22] T. Ball and J. R. Larus, “Optimally profiling and tracing pro-grams,” ACM Transactions on Programming Languages andSystems (TOPLAS), vol. 16, no. 4, pp. 1319–1360, 1994.

[23] S. B. Akers, “Binary decision diagrams,” IEEE Trans. on Com-puters, vol. 27, no. 6, pp. 509–516, 1978.

[24] European Telecommunication Standard Institute, “GSMfull rate speech transcoding,” GSM recommendation 06.10,February 1992.

[25] S. Kim, K. Kum, and W. Sung, “Fixed-point optimization util-ity for C and C++ based digital signal processing programs,”IEEE Trans. on Circuits and Systems II: Analog and Digital Sig-nal Processing, vol. 45, no. 11, pp. 1455–1464, 1998.

[26] L. De Coster, Bit-true simulation of digital signal processingapplications, Ph.D. thesis, KU Leuven, 1999.

[27] Mentor Graphics, “DSP Architect, DFL User’s and ReferenceManual,” 1994.

[28] K. Kum, J. Kang, and W. Sung, “A floating-point to integerC converter with shift reduction for fixed-point digital signalprocessors,” in Proc. IEEE Int. Conf. Acoustics, Speech, SignalProcessing, vol. 4, pp. 2163–2166, Phoenix, Ariz, USA, March1999.

[29] H. Keding, M. Coors, O. Luthje, and H. Meyr, “Fast bit-truesimulation,” in Proc. the Design Automation Conference, pp.708–713, Las Vegas, Nev, USA, June 2001.

[30] S. K. Mitra, Digital Signal Processing: A Computer-Based Ap-proach, McGraw-Hill, New York, NY, USA, 1998.

[31] V. Zivojnovic, J. Martınez, C. Schlager, and H. Meyr, “DSP-stone: A DSP-oriented benchmarking methodology,” in Proc.International Conference on Signal Processing Applications andTechnology, Dallas, Tex, USA, October 1994.


[32] M. Coors, O. Wahlen, H. Keding, O. Luthje, and H. Meyr,“C62x compiler benchmarking and performance coding tech-niques,” in Proc. International Conference on Signal ProcessingApplications and Technology, Orlando, Fla, USA, November1999.

[33] Texas Instruments, USA, “TMS320C6000 Optimizing Com-piler User’s Guide,” March 2000.

Martin Coors received the diploma in elec-trical engineering from Aachen Universityof Technology (RWTH), Aachen, Germany.In 1997, he joined the Institute for Inte-grated Signal Processing Systems (ISS) atRWTH Aachen as a research assistant. Hisresearch interests include DSP code op-timization techniques, fixed-point designmethodologies and code generation for em-bedded processors.

Olaf Luthje received the diploma in elec-trical engineering from Aachen Universityof Technology (RWTH), Aachen, Germany,and is currently working towards the Ph.D.degree in electrical engineering at the sameinstitute. His research interests focus onfixed-point design methodology and dataflow analysis.

Holger Keding received the diploma in electrical engineering fromAachen University of Technology (RWTH), Aachen, Germany.From 1996 to 2001 he was with ISS to work towards his Ph.D.thesis. Having finished his Ph.D., he joined the system level designgroup of Synopsys as a senior corporate application engineer. Hisresearch interests include fast bit-true simulation and fixed-pointand system-level design methodology.

Heinrich Meyr received his M.S. and Ph.D.from ETH Zurich, Switzerland. He spentover 12 years in various research and man-agement positions in industry before ac-cepting a professorship in electrical engi-neering at Aachen University of Technology(RWTH Aachen) in 1977. He has workedextensively in the areas of communicationtheory, synchronization, and digital signalprocessing for the last thirty years. His re-search has been applied to the design of many industrial products.At RWTH Aachen he heads an institute involved in the analysisand design of complex signal processing systems for communica-tion applications. He was a cofounder of CADIS GmbH (acquired1993 by Synopsys, Mountain View, California), a company whichcommercialized the tool suite COSSAP extensively worldwide usedin industry. He is a member of the Board of Directors of two com-panies in the communications industry. Dr. Meyr has publishednumerous IEEE papers. He is author together with Dr. G. Ascheidof the book “Synchronization in Digital Communications,” Wiley1990, and of the book “Digital Communication Receivers. He isalso the author of “Synchronization, Channel Estimation, and Sig-nal Processing” (together with Dr. M. Moeneclaey and Dr. S. Fech-tel), Wiley, October 1997. He holds many patents. He served as aVice President for International Affairs of the IEEE Communica-tions Society and is a Fellow of the IEEE.


Partitioning and Scheduling DSP Applicationswith Maximal Memory Access Hiding

Zhong WangDepartment of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556, USAEmail: [email protected]

Edwin Hsing-Mean ShaDepartment of Computer Science, University of Texas at Dallas, Richardson, TX 75083, USAEmail: [email protected]

Yuke WangDepartment of Computer Science, University of Texas at Dallas, Richardson, TX 75083, USAEmail: [email protected]

Received 2 September 2001 and in revised form 14 May 2002

This paper presents an iteration space partitioning scheme to reduce the CPU idle time due to the long memory access latency.We take into consideration both the data accesses of intermediate and initial data. An algorithm is proposed to find the largestoverlap for initial data to reduce the entire memory traffic. In order to efficiently hide the memory latency, another algorithm isdeveloped to balance the ALU and memory schedules. The experiments on DSP benchmarks show that the algorithms significantlyoutperform the known existing methods.

Keywords and phrases: loop pipelining, initial data, maximal overlap, balanced partition scheduling.

1. INTRODUCTION

The contemporary DSP and embedded systems always con-tain the memory hierarchy, which can be categorized as on-chip and off-chip memories. In general, the on-chip mem-ory have a fast speed and restrictive size, while the off-chipmemory have the much slower speed and larger size. To dothe CPU’s computation, the data need to be loaded from theoff-chip to on-chip memories. Thus, the system performancewill be degraded due to this long off-chip access latency. Howto tolerate the memory latency with memory hierarchy is be-coming a more and more important problem [1]. The on-chip and off-chip memories are abstracted as the first andsecond level memories, respectively, in this paper.

Prefetching [1, 2, 3, 4, 5] is a technique to fetch the datafrom the memory in advance of the corresponding computa-tions. It can be used to hide the memory latency. On the otherhand, software pipelining [6] and modulo scheduling [7, 8]are the scheduling techniques used to explore the parallelismin the loop. Both the prefetching and scheduling techniquescan be used to accelerate the execution speed. However, thesetraditional techniques have some weaknesses [9] such thatthey cannot efficiently solve the problem mentioned in the

first paragraph. This paper combines the software pipelin-ing technique with the data prefetching approach. Multiplememory units, attached to the first level memory, will per-form operations to prefetch data from the second to the firstlevel memories. These memory units are in charge of prepar-ing all data required by the computation in the first levelmemory in advance of computation. Multiple ALU units ex-ist in the processor for doing the computation. The ALUschedule is optimized by using the software pipelining tech-nique under the resource constraints. The operations in theALU units and memory units execute simultaneously. There-fore, the long memory access latency is tolerated by over-lapping the data fetching operations with the ALU opera-tions. Although using computation to hide the memory la-tency has been studied extensively before, trying to balancethe computation and memory loading has never been re-searched thoroughly according to the authors’ knowledge.This paper presents an approach to balance the ALU andmemory schedules to achieve an optimal overall schedulelength.

The data to be prefetched can be classified into twogroups, the intermediate and initial data. The intermediatedata can serve as both left and right operands in the equa-

Partitioning and Scheduling DSP Applications with Maximal Memory Access Hiding 927

tions. Their value will vary during the computation. On thecontrary, the initial data can only serve as right operandsin the equations. They will maintain their value during thecomputation. Take the following equations as an example,the arrays B, C can be regarded as the intermediate data andA as the initial data

B[i + 1] = B[i]∗B[i− 1] + A[i],

C[i + 1] = B[i− 1]∗A[i + 1] + A[i].(1)

The influence of both these two kinds of data should be de-liberated in order to obtain an optimal overall schedule.

To take full use of the data locality, the entire iterationspace can be divided into small blocks named partitions.A lot of works have been done on the partitioning tech-nique. Loop tiling [10, 11] is a technique used to group ba-sic computations so as to increase computation granular-ity and thereby reduce communication time. Generally, theyhave no detailed schedule of ALU and memory operationsas our method. Moreover, only intermediate data are takeninto consideration. Agarwal and Kranz [12] make an ex-tensive study of data partition. They use an approximationmethod to find a good partition to minimize the data trans-fer among the different processors. Affine reference index isconsidered in their work. However, they mainly concentrateon the initial data and have few consideration on the inter-mediate data.

The approaches in [9, 13] are the few approaches to con-sider the detailed schedule under memory hierarchy. Never-theless, their memory references consider only the interme-diate data, and ignore the initial data, which are an importantinfluence factor of performance. From the experimental re-sults in Section 5, we can see that such deficiency will lead toan unbalanced schedule, which means a worse schedule.

In our approach, both the intermediate and initial dataare considered. For the intermediate data, we will restrictour study to nested loops with uniform data dependencies.The study of uniform loop nests is justified by the fact thatmost general linear recurrence equations can be transformedinto a uniform form. This transformation (uniformization[14]) greatly reduces the complexity of the problem. On theother hand, it is difficult to implement uniformization for theinitial data. Therefore, affine reference index is considered.The concept footprint [12] is used to denote the initial dataneeded for the computation of ALU units in one partition.Given a partition shape, this paper presents an algorithm tofind a partition size which can give rise to the maximumoverlap between the adjacent overall footprints such that thenumber of memory operations is reduced to the largest ex-tent.

When considering the schedule of the loop, we proposethe detailed ALU and memory schedules. Each of the mem-ory and ALU operations are assigned to an available hard-ware unit and time slot. Therefore, it is very convenient toapply our technique to a compiler. The memory schedule isbalanced to the ALU schedule such that the overall scheduleis close to the lower bound, which is determined by the ALUschedule. Our method gives the algorithm to determine the

partition shape and size in order to achieve balanced ALUand memory schedules. At last, the memory requirement ofour technique for applications is also presented.

The new algorithm in this paper significantly exceeds theperformance of existing algorithms [9, 13] due to the factthat it optimizes both ALU and memory schedules and con-siders the influence of initial data. Taking the wave digital fil-ter as an example, in a standard system with 4 ALU units and4 memory units, assuming 3 initial data references exist ineach iteration, our algorithm can obtain an average sched-ule length of 4.018 CPU clock cycles, which is very close tothe theoretic lower bound of 4 clock cycles. The traditionallist scheduling needs 22 clock cycles. The hardware prefetch-ing costs 10 clock cycles. While the PSP algorithm in [13]can achieve some improvement, it still needs 8 clock cycles.Without the memory constraint, the algorithm in [9] has thesame performance, 8 clock cycles. Our algorithm improvesall the previous approaches.

It is worthwhile to mention that some works have beendone on data layout technique [15, 16], which is used tomaintain the cache coherency and reduce the conflict traf-fic. Our work should be regarded as another different layerwhich can be built upon the layer of data layout to get a bet-ter performance.

The remainder of this paper is organized as follows.Section 2 introduces the terms and basic concepts used inthe paper. Section 3 presents the theory on initial data.Section 4 describes the algorithm to find the detailed sched-ule. Section 5 contains the experimental result of comparisonof this technique with a number of existing approaches. Weconclude in Section 6.

2. BACKGROUND

We can represent the operations in a loop by a multidimen-sional data flow graph (MDFG) [6]. Each node in the MDFGrepresents a computation. Each edge denotes the data depen-dence between two computations, with its weight as the dis-tance vector. The benefit of using MDFG instead of the gen-eral data dependence graph (DDG) or statement dependencegraph (SDG) is that MDFG is the finer-grained descriptionof data dependences. Each node of MDFG corresponds toone ALU computation. On the contrary, a node always cor-responds to a statement in DDG or SDG, which will consumeuncertain ALU computation time depending on the com-plexity of the statement. It is more convenient to schedulethe ALU operations with MDFG. Moreover, lots of DSP ap-plications, such as DSP filters, and so forth, can be directlymapped into MDFG [17].

The execution of all nodes in an MDFG one time is aniteration. It corresponds to executing the loop body for onetime under a certain loop index. Iterations are identified by a

vector�i, equivalent to a multidimensional index.In this paper, we will always illustrate our ideas under

two-dimensional loops. It is not difficult to extend to loopswith more than two dimensions by using the same idea pre-sented in this paper.


(large, slow)externalmemory

(small, fast)internalmemory

ALU

1

ALU

2

ALU

3

ALUs Memory units

mem

1

mem

2

mem

3

Figure 1: Architecture model with multiple function units and amemory hierarchy.

2.1. Architecture model

The technique in our paper is designed for use in a systemwhich has one or more processors. These processors sharea common memory hierarchy, as shown in Figure 1. Thereare multiple ALU and memory units in the system. The ac-cess time for the first level memory is significantly less thanfor the second level memory, as in current systems. Duringa program’s execution, if one instruction requires data whichis not in the first level memory, the processor will have tofetch data from the second level memory, which will costmuch more time. Thus, prefetching data into the first levelmemory before its explicit use can minimize the overall ex-ecution time. Two types of memory operations, prefetch andkeep are supported by the memory units. The prefetch opera-tion prefetches the data from the second level to the first levelmemories; the keep operation keeps the data in the first levelmemory for the execution of one partition. Both of them areissued to guarantee that those data being referenced in thenear future appear in the first level memory before their ref-erences. It is important to note that the first level memory inthis model cannot be regarded as a pure cache, because wedo not consider the cache associativity. In other words, it canbe thought of as a full-associative cache.

2.2. Partitioning the iteration space

Regular execution of nested loops proceeds in either a row-wise or column-wise manner until the boundary of iterationspace is reached. However, this mode of execution does nottake full advantage of either the locality of reference or theavailable parallelism. The execution of such structures canbe made to be more efficient by dividing the entire iterationspace into regions called partitions that better exploit spatiallocality.

Provided that the total iteration space is divided into par-titions of iterations, the execution sequence will be deter-mined by each partition. Assume that the partition in whichthe loop is executing is the current partition. Then the nextpartition is the partition adjacent on the right side of the

Keep for inter data

Prefetch forinter data

Keep for initial data

Prefetch forinitial data

Memory

1 23 4 5 6 78 9 101112

ALU

132

465

798

101211

CS 1:CS 2:CS 3:CS 4:CS 5:CS 6:CS 7:CS 8:CS 9:CS 10:CS 11:CS 12:CS 13:CS 14:CS 15:CS 16:CS 17:

Figure 2: The overall schedule.

current partition along the x-axis. The other partitions areall partitions except the above two partitions. Based on thisclassification, different memory operations will be assignedto different data in a partition. For a delay dependency thatgoes into the next partition, a keep memory operation is usedto keep this data in the first level memory for one partition,since this data will be reused immediately in the next parti-tion. Delay dependencies that go into other partitions resultin the use of prefetch memory operations to fetch data in ad-vance.

A partition is determined by its partition shape and par-tition size. We use two basic vectors (in a basic vector, eachelement is an integer and all elements have no commonfactor except 1), Px and Py , to identify a parallelogram asthe partition shape. These two basic vectors will be calledpartition vectors. Assume, without loss of generality, that theangle between Px and Py is less than 180◦, and Px is clock-wise of Py . The partition size is determined by the vectorS = ( fx, fy), where fx and fy are the multiples of the parti-tion size over partition vectors Px and Py , respectively. Thus,the partition can be delimited by two vectors fxPx and fyPy .

How to find the optimal partition size will be discussedin Section 4. Due to the dependencies between the iterations,the Px and Py cannot be chosen arbitrarily. The followingproperty gives the condition of a legal partition shape [9].

Property 1. A pair of partition vectors that satisfy the follow-ing constraints is legal. For each delay vector de, the followingcross products1relations hold: de × Px ≤ 0 and de × Py ≥ 0.

Because nested loops should follow the lexicographicalorder, we can choose (1, 0) as our Px vector and use the nor-malized leftmost vector of all delay dependencies as our Py .The partition shape is decided by these two vectors.

An overall schedule consists of two parts: an ALU part anda memory part, as seen in Figure 2. The ALU part sched-

1The cross product p1 × p2 is defined as the signed area of the parallelo-gram formed by the points (0,0), p1, p2, and p1 + p2 = (x1 + x2, y1 + y2). Itis p1 × p2 = p1 · xp2 · y − p1 · yp2 · x.


i

B(2i + j, i − 2 j)

B(i + j, i − j)

Partition3 × 4

j

Figure 3: The footprint.

ules the ALU computation. We know that the computationin a loop can be represented by an MDFG. The ALU part isa schedule of these MDFG nodes. The memory part sched-ules the memory operations—prefetch and keep, so that thedata for the computation can always be found in the first levelmemory.

3. THE THEORY ABOUT INITIAL DATA

The overall footprint of one partition consists of all the initialdata needed by one partition computation. Provided the exe-cution is along the partition sequence, the initial data neededby the current partition computation have been prefetched tothe first level memory at the time of previous partition. Also,the initial data needed by the next partition execution will beprefetched by the memory units during the current partitionexecution. For the overlap between the overall footprints ofthe current and next partitions, they have already been in thefirst level memory. The prefetch operations can be spared.Thus, the major concern for the initial data is how to maxi-mize the overlap between the overall footprints of two con-secutively executed partitions to reduce the memory traffic.

As mentioned in Section 1, we consider affine referencefor the initial data. Given a loop index vector �i, an affine

reference index can be expressed as �g(�i) = �iG + �a, where

G = [ �G1 �G2] is a 2×2 matrix and�a is the offset vector. The foot-

print with respect to a reference A[�g1(�i)] is the set of all data

elements A[�g1(�i)] of A, for�i an element of the partition. Theoverall footprint is the union of the footprints with respectto all different references. For example, in Figure 3, the par-tition is a rectangle with size 3× 4. The initial data referencesare B(i + j, i − j) and B(2i + j, i − 2 j). Their correspondingfootprints are denoted by those integer points marked by ×and •, respectively. The overall footprint is the union of thesetwo footprints.

In [12], Anant presents the concept uniformly generated

references. Two references A[�g1(�i)] and A[�g2(�i)] are said to be

uniformly generated if

�g1(�i) = �iG + �a1, �g2(�i) = �iG + �a2. (2)

If two references B1 and B2 are not uniformly generated, theoverlap between footprint with respect to B1 of the currentpartition and that with respect to B2 of the next partition canbe ignored because the overlap, if exists, diminishes rapidly.Therefore, we need only consider the overlap between foot-prints with respect to uniformly generated references of twoconsecutive partitions. Moreover, the offset vector �a should

satisfy that �a = m �G1 + n �G2, where m and n are integer con-stants. Otherwise, no overlap between the footprints of con-secutive partitions will exist even for the uniformly generatedreferences.

The memory requirement should be taken into accountwhen trying to maximize the overlap. The partition size can-not be enlarged arbitrarily only for the sake of increasingoverlap. In such case, the larger partition means the largeroverall footprint; that is, the much more memory space willbe consumed. Therefore, given a partition shape and a set ofuniformly generated references, we try to derive some condi-tions of the partition size which should be met to achieve areasonable maximal overlap. For the convenience of descrip-tion, we introduce the following notations.

Definition 1. (1) Assuming the partition size is �S, f (�a, �S) isthe footprint with respect to reference with offset vector �a of

the current partition, and f (�a′, �S) is the footprint with re-spect to reference with offset �a of the next partition.

(2) Given a set of uniformly generated references, the setR = {�a1, �a2, . . . , �an} is set of offset vectors.2 Assuming the

partition size is �S, F(R, �S) is the overall footprint of the cur-

rent partition and F(R′, �S) is the overall footprint of the nextpartition.

The one-dimensional case can be regarded as a simpli-fication to the two-dimensional problem, in which the fy isalways set to zero. It provides the theoretic foundation forthe two-dimensional problem. In the case of one dimension,a partition is reduced to a line segment and all vectors re-duce to integer numbers. The partition size can be thoughtof as the length of the line segment. We use an example todemonstrate the problem we are tackling. In Figure 4, thereare three different offset vectors: 1, 2, 7. The solid lines rep-resent the overall footprint of the current partition, and dot-ted lines denote that of the next partition. Then, we need tofind the condition of the partition size, that is, the length ofthe line segment, to achieve a maximal overlap. The figureshows the case when the length equal 5, which is the mini-mum length to obtain the maximum overlap between overallfootprints.

In order to derive the theorem on the minimum valueS which can generate the maximum overlap, we first have

2Note that the elements in the set R are in lexicographically increasingorder.


181614121086420

Figure 4: One-dimensional line segments.

{{ S

a2S

a1

(a) Case 1.{{ S

a2S

a1

(b) Case 2.

Figure 5: Two different relations between a1 and a2.

the following lemmas. They are used to consider the over-lap of two footprints of the consecutive partitions, as show inFigure 5. The solid line is the footprint of the current parti-tion and the dotted line is the footprint of the next partition.

Lemma 1. The minimum S is a2 − a1 which makes the max-imum intersection between f (a′1, S) and f (a2, S), where a2 ≥a1.

Proof. According to the relation between (a1 + S) and a2,there are two different cases.

Case 1. As shown in Figure 5a, a1 + S ≤ a2, that is, S ≤a2 − a1. The intersection is (a2, a1 + 2S− 1). It can reach themaximum value a2 − a1 when S = a2 − a1.

Case 2. As shown in Figure 5b, a1 + S > a2, that is, S >a2−a1. The intersection of two segments is (a1 +S, a2 +S−1).It has no relation to S. This means the size of intersection willnot increase in spite of the increment of S.

Lemma 2. For the intersection between f (a′1, S) and f (a2, S),where a2 ≥ a1, it will keep constant, irrelevant to the value ofS, as long as S ≥ a2 − a1.

According to Definition 1, F(R, S) and F(R′, S) can be ex-pressed as

F(R, S) = f(a1, S

)∪ f(a2, S

)∪· · · ∪ f(an, S

),

F(R′, S

) = f(a′1, S

)∪ f(r′2, S

)∪· · · ∪ f(r′n, S

).

(3)

The following lemma gives the expression of their intersec-tion.

Lemma 3. Let Cm be the intersection f (am, S) ∩ f (a′m−1, S).Then the intersection of F(R, S) and F(R′, S) is

⋃n2 Cm, where

the number of integers in R is n.

Proof. Let Am denote f (rm, S), and Bm denote f (r′m, S).Basis step. Let n = 2. Then F(R, S) = A1 ∪ A2 and

F(R′, S) = B1 ∪ B2. The ending point of A1 is less than thestarting point of B1 and B2, the starting point of B2 is greaterthan the ending point of A1 and A2. Thus, the only possibleintersection is A2 ∩ B1.

Induction hypothesis. Assume that, for some n ≥ 2,F(R, S)∩ F(R′, S) = ⋃n

2 Cn.Induction step. For n+ 1, the added intersection is An+1∩

(B1 ∪ B2 ∪ · · · ∪ Bn). There are two different cases.(1) an+1 ≥ (an +S). Then An+1 can only intersect with Bn.(2) an+1 < (an + S). Then An+1 can be divided into two

parts, A′ = (an+1, an + S) and A′′ = (an + S, an+1 + S− 1)

An+1 ∩(B1 ∪ B2 ∪ · · · ∪ Bn

)= A′ ∩ (B1 ∪ B2 ∪ · · · ∪ Bn

)∪ A′′ ∩ (B1 ∪ B2 ∪ · · · ∪ Bn

)⊆

n⋃2

Cn ∪(An+1 ∩ Bn

)= Cn+1.

(4)

Therefore, F(R, S)∩ F(R′, S) = ⋃n+12 Cn.

Theorem 1. Given the set R = (a1, a2, a3, . . . , an), the maxi-mum intersection between F(R, S) and F(R′, S) can be achievedwhen S = maxnm=2(am − am−1).

Proof. When considering two adjacent Cm and Cm−1, we haveCm = Am ∩ Bm−1 and Cm−1 = Am−1 ∩ Bm−2. There is nocommon element between Bm−1 and Am−1, neither is Cm andCm−1. According to Lemmas 1 and 2, the value x ≥ rm− rm−1

can make segment Cm largest. Moreover, each Cm will notintersect each other. Therefore, the theorem is correct.

From Theorem 1 and Lemma 2, we can directly derivethe following theorem.

Theorem 2. For the overall footprints F(R, S) and F(R′, S),their overlap will keep constant if the value of S continues toincrease from the S value obtained by Theorem 1.

To maximize the overlap between F(R, �S) and F(R′, �S) inthe two dimension space, we can find that the fy element ofthe partition size is not so important as the fx element, sincethe intersection always increases when fy is enlarged. We willdetermine the value of fy based on other conditions. There-fore, the key is what is the minimum value of fx to make theintersection maximum, given a certain fy .

Next, we discuss the situation with G a two-dimensionalidentity matrix. If G is not an identity matrix, the same idea


123

4

567

Figure 6: The stripe division of a footprint.

can be applied as long as �a = m �G1 + n �G2. The only dif-ference is that the original XY-space will be transformedto the new space by the G matrix. An augment set R∗

can be obtained based on a certain partition size of �S andthe set R with the following method: a∗i = ai, a

∗i+n =

ai + fyPy · y, where n is the size of the set R and Py =(Py · x, Py · y). Arranging all the points in the set R∗

with the increasing order of the Y element, the overallfootprint of one partition can be divided into a series ofstripes. Each stripe is determined by two horizontal lineswhich pass the two adjacent points sorted in R∗. For in-stance, in Figure 6, the R set is {(0, 0), (6, 1), (3, 2), (1, 3)}.Assume the value of fyPy · y is 5, then the augment set R∗

is {(0, 0), (0, 5), (6, 1), (6, 6), (3, 2), (3, 7), (1, 3), (1, 8)}. Aftersorting, it will become {(0, 0), (6, 1), (3, 2), (1, 3), (0, 5), (6, 6),(3, 7), (1, 8)}. The overall footprint consists of 7 stripes as in-dicated in Figure 6.

In each stripe, a horizontal line will intersect with

left bounds of some footprints f (�a, �S). Thus, the two-dimensional intersection problem of this stripe in the foot-print can be reduced to the one-dimensional problem, whichcan be solved using Theorem 1. Applying this idea to eachstripe, we can solve the two-dimensional overlap problem, asdemonstrated in Algorithm 1. The algorithm is obviously apolynomial-time algorithm, whose time complexity isO(n2).

From Lemma 2, the intersection will keep constant iffx is greater than the value chosen by this algorithm, andwill reduce with less fx. We can demonstrate this phe-nomenon by two examples. The set R for the first exam-ple is {(0, 1), (5, 3), (−3, 1), (4,−1), (−2,−2)} and the par-tition shape is (1, 0) × (0, 1). It is the partition shapefor wave digital filter. The set R for the second exampleis {(0, 2), (3, 5), (1, 3), (−1,−1)} and the partition shape is(1, 0) × (−3, 1). It is the partition shape for two-dimensionalfilter. Figures 7a and 7b show the varying trends of footprintintersection with the value of fx and fy for two examples, re-spectively.

4. THE OVERALL SCHEDULE

The overall schedule can be divided into two parts—ALU andmemory schedules. For the ALU schedule, the multidimen-sional rotation scheduling algorithm [6] is used to generate a

Input: The set R and the shape of the partitionOutput: The fx to make the overlap maximum under acertain fy .

(1) Set fx to 0.(2) Based on the set R and partition shape, choose an fy

such that the product fy ∗ Py · y is larger than thedifference between the largest and least b element ofall vectors in the set R.

(3) Using the fy above, generate the augment set R∗.(4) Sort all the values in the R∗ in increasing order

according to the b element and keep them in anevent list.

(5) Use a horizontal line to sweep the whole iterationspace. When an event point is met, insert the

corresponding set f (�a, �S) in a visiting list, if theevent point is the lower bound of the footprint.

Otherwise delete the corresponding f (�a, �S) from thelist.

(6) Calculate the intersection point of this line with theleft bound and right bound of each set in thevisiting list, respectively. Use Theorem 1 to derive anf ′x value to make the intersection in the currentstripe maximal.

(7) Replace fx with f ′x if f ′x > fx .

Algorithm 1: Calculating the minimum x to make the overlapmaximum.

static schedule for one iteration. Then the entire ALU sched-ule can be formed by simply replicating this schedule for eachiteration in the partition. The schedule obtained in this wayis the most compact schedule since it only considers the ALUhardware resource constraints. The overall schedule lengthmust be longer than it. Thus, this ALU schedule provides alower bound for the overall schedule. This lower bound canbe calculated by #leniteration × #nodes, where leniteration repre-sents the schedule length obtained by multidimensional rota-tion scheduling algorithm for one iteration, and #nodes de-notes the number of iteration nodes in one partition. Ourobjective is to find a partition whose overall schedule lengthcan be very close to this lower bound.

4.1. Balanced overall schedule

Different from the ALU schedule, the memory schedule isconsidered as an integrate for the entire partition. It consistsof two parts: memory operations for initial data and interme-diate data. Each part consists of the prefetch and keep opera-tions for the corresponding data. Because all the prefetch op-erations have no relations to the current computation, theycan be arranged from the beginning of the memory schedulepart. On the contrary, the keep operation for intermediatedata can only be issued after the corresponding computationhas finished. The keep operations for initial data can be is-sued as soon as they have been prefetched. The memory partschedule length is the summation of these two parts’ sched-ule lengths.

For the intermediate data, the calculation of the number


30252015105010

20

30

40

50

60

70

80

90

100

y = 3

y = 5

y = 6

y = 8

(a) 2D.

3025201510500

20

40

60

80

100

120

140

y = 3

y = 5

y = 7

(b) WDF.

Figure 7: The tendency of intersection with fx and fy .

of prefetch and keep operations can refer to [13]. For the ini-tial data, they can be prefetched in blocks. This kind of oper-ation can fetch several data at one time and costs only a littlelonger time than general prefetch operation. To calculate thenumber of such operations, we first have the following ob-servation.

Property 2. As long as fyPyG2, the projection of footprintsize along the direction G2, is larger than the maximum dif-ference of �aG2, for all �a belongs to a uniformly generated off-set vector set, the overall footprint will increase at a constantrate with the increment of fy , so does the number of prefetchoperations for initial data.

Note the requirement in the above property guaranteesthat the partition is large enough, such that the footprintwith respect to an offset vector can intersect with the foot-print with respect to all other offset vectors belonging to thesame uniformly generated set.

Suppose that a two-dimensional vector can be written as�a = (a · x, a · y). Given a certain fx, the number of prefetch

operations for initial data for any fy , which satisfy the condi-tion in the above property, is PreBase ini +( fy− fy0 )×Preincr ini,where fy0 = y0/((PyG) · y)�, y0 is the maximum differenceof (�aG) · y for all offset vectors, PreBase ini denotes the num-ber of such operations for a partition with size fx × fy0 , andPreincr ini represents the increment of number of prefetch op-erations when fy is increased by one.

The keep operations for the initial data can be issued afterthey have been prefetched. The number of such keep opera-tions is KeepBase ini + ( fy − fy0 ) × Keepincr ini, where y0 andfy0 have the same meaning as above. KeepBase ini denotes thenumber of keep operations for a partition with size fx × fy0 ,and Keepincr ini represents the increment of keep operationswhen fy is increased by one.

In order to understand what is a good partition size, wefirst need the definition of the balanced overall schedule. Italso gives the balanced overall schedule requirement.

Definition 2. A balanced overall schedule is a schedule forwhich the memory schedule is at most one unit time of keepoperation longer than the ALU schedule.

To reduce the computation complexity and simplify theanalysis, we add a restriction on the partition size: the parti-tion size is large enough that no data dependence can spanmore than two partitions.

(1) There is no delay dependency which can span morethan two partitions along the y coordinate direction, that is,fy ∗ Py · y ≥ dy , for all d = (dx, dy) ∈ D.

(2) There is no delay dependency which can span morethan two partitions along the x coordinate direction, that is,fx > max{dx − dy(Py · y/(Py · x))}.

As long as these constraints on minimal partition sizeare satisfied, the length of prefetch and keep parts for inter-mediate data in memory schedule increases slower than theALU schedule length when partition size is enlarged. At thistime, if a partition size cannot be found to meet the balancedoverall schedule requirement, it means that the length of theblock prefetch part for initial data increases too fast. Due tothe property of block prefetch, increasing fx will increase thenumber of block prefetch only by a small number, while in-crease the ALU part by a relative large length. Therefore, apartition size which satisfy the balanced overall schedule re-quirement can be found. Algorithm 2 determines the parti-tion size to obtain the balanced overall schedule.

After the optimal partition size is determined, the opera-tions in ALU and memory schedules can be easily arranged.For the ALU part, it is the duplication of the schedule for oneiteration. For the memory part, the memory operations forinitial data are allocated first, then are the memory opera-tions for intermediate data, as we discussed above.

The memory requirement for a partition consists of fourparts, the memory requirement for the calculation of in-partition data, the memory for prefetch operations of inter-mediate data, the memory for keep operations of intermedi-ate data, and the memory for those operations of initial data.The memory consumption calculation for in-partition datacan refer to [9]. For the other part memory requirements,


Table 1: Experimental results with only one initial data.

BenchmarkPar vector New algo Partition algo List Hardware

Px Py size m r len size m r len ratio len ratio len ratio

WDF (1, 0) (−3, 1) 4× 7 221 4.107 4× 4 143 5.312 22.68% 18 77.18% 10 58.93%

IIR (1, 0) (−2, 1) 4× 9 407 6.028 4× 7 350 6.893 12.55% 36 83.26% 37 83.71%

DPCM (1, 0) (−2, 1) 8× 10 736 4.01 8× 8 628 4.891 18.01% 25 83.96% 21 80.9%

2D (1, 0) (0, 1) 3× 5 233 12 3× 4 207 12 0.0% 55 78.18% 51 76.47%

Floyd (1, 0) (−3, 1) 7× 5 301 6.057 4× 4 174 6.312 4.04% 32 81.72% 30 79.81%

Input: The ALU schedule for one iteration, the partitionshape Px × Py and the initial data offset vector set R.Output: A partition size which can generate a balancedoverall schedule.

(1) Based on the information of initial data, useAlgorithm 1 to calculate the minimum partition sizef ′x and f ′y .

(2) Using the two above conditions on partition size,calculate another pair of minimum f ′′x and f ′′y .

(3) Get a new pair fx = max( f ′x , f′′x ) and

fy = max( f ′y , f′′y ).

(4) Using this pair ( fx, fy), calculate the number ofprefetch operations, block prefetch operations, andkeep operations.

(5) Calculate the ALU schedule length to see if thebalanced overall schedule requirement is satisfied.

(6) If it is satisfied, this pair ( fx, fy) is the partition size.Otherwise, increase fx by one, use the balancedoverall schedule requirement to find the minimumfy . If such fy does not exist, continue increasing fxuntil the feasible fy is found. Use them as partitionsize.

(7) Based on the partition size, output thecorresponding ALU part schedule and memory partschedule.

Algorithm 2: Find a balanced overall schedule.

they can be computed simply by multiplying the number ofoperations with the memory requirement of each operation.The memory requirement for a prefetch operation is 2. Oneis used to store the data prefetched by the previous partitionand consumed in the current partition, the other stores thedata prefetched by the current partition and consumed in thenext partition. As the same rule, the keep operation will take2 memory locations, too. The block prefetch operations willtake 2× block size memory locations.

5. EXPERIMENT

In this section, we use several DSP benchmarks to illus-trate the effectiveness of our new algorithm. They are WDF,IIR, DPCM, 2D, and Floyd, as indicated in Tables 1 and2, which stand for wave digital filter, infinite impulse re-sponse filter, differential pulse-code modulation device, two-dimensional filter and Folyd-Steinberg algorithm, respectively.

These are DSP filters in common usage in real DSP applica-tions. We applied five different algorithms on these bench-marks: list scheduling, hardware prefetching scheme, par-titioning algorithms in [9, 13] and our new partition al-gorithm (since it has been shown in [9] that loop tilingtechnique cannot outperform partitioning algorithms, wedo not compare the result of loop tiling in this section). Inlist scheduling, the same architecture model is used. How-ever, the ALU part uses the traditional list scheduling algo-rithm, and the iteration space is not partitioned. In hardwareprefetching scheduling, we use the model presented in [18].In this model, whenever a block is accessed, the next blockis also loaded. The partitioning algorithms in [9, 13] assumethe same architecture model as ours. They partition the it-eration space and execute the entire loop along the partitionsequence. However, they do not take into account the influ-ence of the initial data.

In the experiment, we assume an ALU computation, akeep operation of one clock cycle, a prefetch time of 10 CPUclock cycles, and a block prefetch time of 16 CPU clock cycles,which is reasonable when the big performance gap betweenCPU and the main memory is considered. Table 1 presentsresults with only one initial data with the offset vector (1, 1),and Table 2 presents results with three initial data with theoffset vector set {(1, 1), (2,−2), (0, 3)}. Note all these threeinitial data references are uniformly generated. From the dis-cussion in Section 4, the overall footprint is only the sim-ple summation of the footprint with respect to different uni-formly generated reference sets. In Tables 1 and 2, the par vec-tor column determines the partition shape. The list columnlists the schedule length for list scheduling and the improve-ment ratio our algorithm can get compared to list schedul-ing. The hardware column lists the schedule length for hard-ware prefetching and our algorithm’s relative improvementratio. Since the algorithm in [13] will get the same result asthe algorithm in [9] when there is no memory size constraint,we merge their results into one column partition algo. In thepartition algo and new algo columns, the size column is thesize of partition presented with the multiple of partition vec-tors. The m r column represents the corresponding mem-ory requirement and the len column is the average schedul-ing length for corresponding algorithms. The ratio column isthe improvement our new algorithm can get relative to thecorresponding algorithms.

The list scheduling and hardware prefetching schedulethe operations based on the iteration, which will result in the


Table 2: Experimental results with three initial data.

BenchmarkPar vector New algo Partition algo List Hardware

Vx Vy size m r len size m r len ratio len ratio len ratio

WDF (1, 0) (−3, 1) 8× 7 474 4.018 4× 4 206 8 49.78% 22 81.74% 10 58.92%

IIR (1, 0) (−2, 1) 5× 13 772 6.015 4× 7 472 7.857 23.44% 40 84.96% 37 83.74 %

DPCM (1, 0) (−2, 1) 8× 14 1207 4.001 8× 8 811 5.266 24.02% 29 86.2% 21 80.95%

2D (1, 0) (0, 1) 4× 5 346 12 3× 4 253 13.833 13.25% 59 79.66% 51 76.47%

Floyd (1, 0) (−3, 1) 8× 6 526 6 4× 4 223 8.812 31.91% 36 83.33% 30 80%

much longer memory schedule. It is this dominant memoryschedule that leads to an overall schedule which is far awayfrom the balanced schedule. Thus, lots of ALU resources arewasted waiting for the data. Their much worse performancecompared with the partitioning technique can be seen fromthe tables.

Although the traditional partitioning algorithms con-sider the balance of ALU and memory schedules for interme-diate data. They lack of the consideration for the initial data.The time consumption to load the initial data is a rather sig-nificant influence factor for one partition. The lack of suchconsideration will result in an unbalanced overall schedule.The memory latency cannot be efficiently hidden. This is thereason why traditional partitioning algorithms get the worseperformance than our new algorithm. It also explains the re-sults that the performance will become worse as the initialdata references increase. Our new algorithm considers bothdata locality and the initial data. Therefore, the much bet-ter performance can be achieved through balancing the ALUpart and memory schedule.

6. CONCLUSION

In this paper, a new scheme that can obtain a minimal av-erage schedule length under the consideration of initial datawas proposed. The theories and an algorithm on initial datawere presented. The algorithm explores the ILP among in-structions by using software pipelining techniques and com-bines it with data prefetching to produce high throughputschedules. Experiments on DSP benchmarks show that ourscheme can always produce a better average schedule lengththan existing methods.

REFERENCES

[1] T. Mowry, “Tolerating latency in multiprocessors throughcompiler-inserted prefetching,” ACM Trans. Computer Sys-tems, vol. 16, no. 1, pp. 55–92, 1998.

[2] T.-F. Chen, Data prefetching for high-performance processors,Ph.D. thesis, Dept. of Comp. Sci. and Engr., University ofWashington, Wash, USA.

[3] F. Dahlgren and M. Dubois, “Sequential hardware prefetchingin shared-memory multiprocessors,” IEEE Trans. on Paralleland Distributed Systems, vol. 6, no. 7, pp. 733–746, 1995.

[4] N. Manjikian, “Combining loop fusion with prefetching onshared-memory multiprocessors,” in Proc. International Con-ference on Parallel Processing, pp. 78–82, Bloomingdale, Ill,USA, August 1997.

[5] M. K. Tcheun, H. Yoon, and S. R. Maeng, “An adaptive se-

quential prefetching scheme in shared-memory multiproces-sors,” in Proc. International Conference on Parallel Processing,pp. 306–313, Bloomington, Ill, USA, August 1997.

[6] N. Passos and E. H.-M. Sha, “Scheduling of uniform multi-dimensional systems under resource constraints,” IEEE Trans.on VLSI Systems, vol. 6, no. 4, pp. 719–730, 1998.

[7] W. Mangione-Smith, S. G. Abraham, and E. S. Davidson,“Register requirements of pipelined processors,” in Proc.International Conference on Supercomputing, pp. 260–271,Washington, DC, USA, July 1992.

[8] B. R. Rau, “Iterative modulo scheduling: an algorithm forsoftware pipelining loops,” in Proc. 27th Annual InternationalSymposium on Microarchitecture, pp. 63–74, San Jose, Calif,USA, November 1994.

[9] Z. Wang, T. W. O’Neil, and E. H.-M. Sha, “Minimizing av-erage schedule length under memory constraints by optimalpartitioning and prefetching,” Journal of VLSI Signal Process-ing, vol. 27, no. 3, pp. 215–233, 2001.

[10] P. Bouilet, A. Darte, T. Risset, and Y. Robert, “(pen)-ultimatetiling,” in Scalable High-Performance Computing Conference,pp. 568–576, Knoxville, Tenn, USA, May 1994.

[11] J. Chame and S. Moon, “A tile selection algorithm for data lo-cality and cache interference,” in Proc. 13th ACM InternationalConference on Supercomputing, pp. 492–499, Rhodes, Greece,June 1999.

[12] A. Agarwal, D. A. Kranz, and V. Natarajan, “Automatic par-titioning of parallel loops and data arrays for distributedshared-memory multiprocessors,” IEEE Trans. on Parallel andDistributed Systems, vol. 6, no. 9, pp. 943–962, 1995.

[13] F. Chen and E. H.-M. Sha, “Loop scheduling and partitionsfor hiding memory latencies,” in Proc. IEEE 12th InternationalSymposium on System Synthesis, pp. 64–70, San Jose, Calif,USA, November 1999.

[14] V. Van Dongen and P. Quinton, “Uniformization of linear re-currence equations: a step towards the automatic synthesis ofsystolic array,” in International Conference on Systolic Arrays,pp. 473–482, San Diego, Calif, USA, May 1988.

[15] R. Bixby, K. Kennedy, and U. Kremer, “Automatic data layoutusing 0-1 integer programming,” in Proc. International Con-ference on Parallel Architectures and Compilation Techniques,pp. 111–122, Montreal, Canada, August 1994.

[16] G. Rivera and C. W. Tseng, “Eliminating conflict misses forhigh performance architectures,” in Proc. 1998 AACM In-ternational Conference on Supercomputing, pp. 353–360, Mel-bourne, Australia, July 1998.

[17] N. L. Passos, E. H.-M. Sha, and S. C. Bass, “Schedule-based multi-dimensional retiming on data flow graphs,” IEEETrans. Signal Processing, vol. 44, no. 1, pp. 150–156, 1996.

[18] J. L. Baer and T. F. Chen, “An effective on-chip preloadingscheme to reduce data access penalty,” in Proc. Supercom-puting ’91, pp. 176–186, Albuquerque, NM, USA, November1991.


Zhong Wang received a Bachelor’s degreein electric engineering in 1994 from Xi’anJiaotong University, China and a Master’sdegree in information and signal process-ing in 1998 from Institute of Acoustics,Academia Sinica, China. Currently, he ispursuing his Ph.D. in computer science andengineering at University of Notre Dame inIndiana. His current research focuses on theloop scheduling and high-level synthesis.

Edwin Hsing-Mean Sha received his B.S.degree in computer science and informa-tion engineering from National TaiwanUniversity, Taipei, Taiwan, in 1986; he re-ceived the M.S. and Ph.D. degrees from theDepartment of Computer Science, Prince-ton University, Princeton, NJ, in 1991 and1992, respectively. From August 1992 to Au-gust 2000, he was with the Department ofComputer Science and Engineering at Uni-versity of Notre Dame, Notre Dame, IN. He served as AssociateChairman for Graduate Studies since 1995. He is now a tenuredfull professor in the Department of Computer Science at the Uni-versity of Texas at Dallas. He has published more than 140 researchpapers in refereed conferences and journals. He has been serving asan editor for several journals such as IEEE Transactions on SignalProcessing and Journal of VLSI Signal Processing. He also servedas program committee member in numerous conferences. He re-ceived Oak Ridge Association Junior Faculty Enhancement Awardin 1994, and NSF CAREER Award. He was a guest editor for thespecial issue on Low Power Design of IEEE Transactions on VLSISystems in 1997. He also served as the program chairs for the Inter-national Conference on Parallel and Distributed Computing Sys-tems (PDCS), 2000 and PDCS 2001. He received Teaching award in1998.

Yuke Wang received his B.S. degree fromthe University of Science and Technologyof China, Hefei, China, in 1989, the M.S.and Ph.D. degrees from the University ofSaskatchewan, Canada, in 1992 and 1996,respectively. He has held faculty positions atConcordia University, Canada, and FloridaAtlantic University, Florida, USA. Currentlyhe is an Assistant Professor at the ComputerScience Department, University of Texas atDallas. He has also held visiting assistant professor positions in theUniversity of Minnesota, the University of Maryland, and the Uni-versity of California at Berkeley. Dr. Yuke Wang is currently an Edi-tor of IEEE Transactions on Circuits and Systems, Part II, an Editorof IEEE Transactions on VLSI Systems, an Editor of Applied SignalProcessing, and a few other journals. Dr. Wang’s research interestsinclude VLSI design of circuits and systems for DSP and communi-cation, computer aided design, and computer architectures. During1996–2001, he has published about 60 papers among which about20 papers are in IEEE/ACM Transactions.


P-CORDIC: A Precomputation Based RotationCORDIC Algorithm

Martin KuhlmannBroadcom Corporation, Irvine, CA 92619, USAEmail: [email protected]

Keshab K. ParhiDepartment of Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN 55455, USAEmail: [email protected]


This paper presents a CORDIC (coordinate rotation digital computer) algorithm and architecture for the rotation mode in whichthe directions of all micro-rotations are precomputed while maintaining a constant scale factor. Thus, an examination of the signof the angle after each iteration is no longer required. The algorithm is capable to perform the CORDIC computation for anoperand word-length of 54 bits. Additionally, there is a higher degree of freedom in choosing the pipeline cutsets due to the novelfeature of independence of the iterations i and i− 1 in the CORDIC rotation.

Keywords and phrases: CORDIC, computer arithmetic, constant scale factor, precomputation, rotation mode.

1. INTRODUCTION

CORDIC (coordinate rotation digital computer) [1, 2] is aniterative algorithm for the calculation of the rotation of a 2-dimensional vector, in linear, circular, or hyperbolic coor-dinate systems, using only add and shift operations. It hasa wide range of applications including discrete transforma-tions such as Hartley transform [3], discrete cosine trans-form [4], fast Fourier transform (FFT) [5], chirp Z trans-form (CZT) [6], solving eigenvalue and singular value prob-lems [7], digital filters [8], Toeplitz system and linear systemsolvers [9], and Kalman filters [10]. It is also able to detectmultiuser in code division multiple access (CDMA) wirelesssystems [11].

The CORDIC algorithm consists of two operatingmodes, the rotation mode and the vectoring mode, respec-tively. In the rotation mode, a vector (x, y) is rotated by anangle θ to obtain the new vector (x∗, y∗) (see Figure 1). Inevery micro-rotation i, fixed angles of the value arctan(2−i)are subtracted or added from/to the angle remainder θi, sothat the angle remainder approaches zero. In the vectoringmode, the length R and the angle towards the x-axis α of avector (x, y) are computed. For this purpose, the vector is ro-tated towards the x-axis so that the y-component approacheszero. The sum of all angle rotations is equal to the value ofα, while the value of the x-component corresponds to thelength R of the vector (x, y). The mathematical relations for

y∗

y

θα

x∗ x R

Figure 1: The rotation and vectoring mode of the CORDIC algo-rithm.

the CORDIC rotations are as follows:

xi+1 = xi +m · σi · 2−i · yi,yi+1 = yi − σi · 2−i · xi,

zi+1 = zi − 1m· σi · arctan

(√m2−i

),

(1)

where σi is the weight of each micro-rotation and m steers

P-CORDIC: A Precomputation Based Rotation CORDIC Algorithm 937

the choice of rectangular (m = 0), circular (m = 1), or hy-perbolic (m = −1) coordinate systems. The required micro-rotations are not perfect rotations, they increase the length ofthe vector. In order to maintain a constant vector length, theobtained results have to be scaled by a scale factor K . Nev-ertheless, assuming consecutive rotations in positive and/ornegative directions, the scale factor is constant and can beprecomputed according to

K =n−1∏i=0

ki =n−1∏i=0

(1 + σ2

i · 2−2i)1/2. (2)

The computation of the scale factor can be truncated aftern/2 iterations because the multiplicands in the last n/2 itera-tions are 1 due to the finite word-length and do not affect thefinal value of K1,

K1 =n/2∏i=0

(1 + σ2

i · 2−2i)1/2. (3)

There are two different approaches for the computationof the CORDIC algorithm. The first one uses consecutiverotations in positive and/or negative direction, where theweight of each rotation is 1. Hence, σi is either −1 or 1, de-pending on the sign of the angle remainder z(i). In every it-eration a significant amount of time is used to examine themost significant bit in case of a binary architecture or themost significant three digits of a redundant architecture topredict the sign of z(i) and hence the rotation direction σi. Incomparison to the CORDIC implementations with constantscale factor, other implementations use a minimally redun-dant radix-4 or an even higher radix number representation[12, 13, 14]. These architectures make use of a wider rangeof σi. In case of a minimally redundant radix-4 architecture,σi ∈ {−2,−1, 0, 1, 2}. By using this numbering system, thenumber of iterations can be reduced. However, the compu-tation time per iteration increases, since it takes more time todifferentiate between five different rotation direction valuesand to generate five different multiples of arctan(2−i). Thescale factor also becomes variable and has to be computedevery time, due to the absence of consecutive rotations lead-ing to an increase in area.

To speed up the computation time of the CORDIC algo-rithm, either the number of iterations or the delay of each it-eration have to be minimized. The proposed algorithm intro-duces a novel approach, in which the rotation direction canbe precomputed by adding the rotation angle θ, a constantand a variable adjustment which is stored in a table. Hence,a significant speedup of the delay per iteration is obtained.Since all rotation directions are known before the actual rota-tion begins, more than one rotation can also be performed inone iteration leading to a reduction in latency. The proposedarchitecture also eliminates the z-datapath and reduces thearea of the implementation.

This paper is organized as follows. Section 2 presents thetheoretical background for the novel CORDIC algorithm forrotation mode and Section 3 presents the novel architecture.

Section 4 performs an evaluation of different CORDIC ar-chitectures while Section 5 concludes the paper.

2. THE NOVEL CORDIC ALGORITHM

2.1. Mathematical derivation using Taylor series

The summation of all micro-rotation with their correspond-ing weight σi is equivalent to the rotation angle θ

θ =n∑i=0

σi · arctan(2−i), (4)

where σi ∈ {−1, 1}, corresponding to the addition and sub-traction of the micro-angles θi. Since consecutive rotationsare employed, the scale factor is constant. The value of σ canbe interpreted as a number in radix-2 representation. Thegoal of the proposed method is to compute the sequence ofthe micro-rotation without performing any iteration. To ac-complish this, σi is recoding as 2di − 1 leading to a binaryrepresentation in which a zero corresponds to the additionof a micro-angle [15, 16]. This allows the use of simple bi-nary adders. Adding and subtracting 2−i to (4) results in

θ =∞∑i=0

(2di − 1

) · (2−i − 2−i + arctan(2−i))

(5)

=∞∑i=0

(2di − 1

) · 2−i −∞∑i=0

(2di − 1

) · (2−i − arctan(2−i))(6)

=2d−2+∞∑i=0

(2−i−arctan

(2−i))− ∞∑

i=0

2di(2−i−arctan

(2−i))

(7)

= 2d − c1 − sign(θ) · 2(1− arctan(1)

)−

∞∑i=1

2di(2−i − arctan

(2−i)),

(8)

where c1 corresponds to c1 = 2−∑∞i=0(2−i − arctan(2−i)).

Solving (8) for d results in

d = 0.5θ + 0.5c1 + sign(θ) · (1− arctan(1))

−∞∑i=1

di ·(2−i − arctan

(2−i))

= 0.5θ + c + sign(θ) · ε0 −∞∑i=1

di · εi,

(9)

where c corresponds to 0.5c1. Table 1 shows the values of thepartial offsets εi for the first 10 values of i and indicates thatthe value of εi decreases approximately by a factor of 8 withincreasing i. Hence, the summation of diεi can be limited to

d = 0.5θ + c − sign(θ) · ε0 −n/3∑i=1

di · εi,

d = 0.5θ + c − sign(θ) · ε0 − δ.(10)


Table 1: The values of εi of the first 10 values of i.

Iteration i Partial offset εi0 0.214601836602551690

1 3.635239099919176e-02

2 5.021336873135844e-03

3 6.450054532385649e-04

4 8.119000404265152e-05

5 1.016656973172375e-05

6 1.271379523169197e-06

7 1.589398988887035e-07

8 1.986803302817237e-08

9 2.483521181314878e-09

Rather than storing the partial offsets εi and computingthe sum over all i of the product diεi, δ =

∑n/3i=1 diεi can be

precomputed and stored. Hence, the only difficulty consistsof determining which offset corresponds to the input θ. Thiscan be achieved by comparing the input θ with a referenceangle θref. The reference angles θref correspond to the sum-mation of the first n/3 micro-rotation. To be certain to obtainthe correct offset, θ has to be larger than the reference angleθref. All reference angles are stored in a ROM and are accessedby the most significant n/3 bits of θ. In addition to the refer-ence angles, the values of δ are stored. In case of a negativedifference θref − θ, the corresponding δ is selected, otherwisethe next smaller value of δ is chosen to be subtracted fromθ + c − sign(θ) · ε0.

Example 1. Assuming we have a word-length of 16 bits andθ = 0.9773844. According to Table 2, θref corresponds to0.97337076 and δ = 0.03644375. Hence, d is computed as

d = 0.5 · θ + 1− 0.5 · c +n/3∑i=0

diεi

= 0.5 · 0.9773844 + 1.08624513 + 0.03644375

= 1.6113811 = 1.10011100100000112,

σ = 11111111111111111.

(11)

2.2. High precision

By using a mantissa of n = 54 bits (corresponding to thefloating point precision), the ROM for storing all offsetswould require 218 entries. This is rather impractical since therequired area to implement the ROM will exceed by far thearea for the CORDIC implementation. To reduce the area forthe ROM, δ can be split into two parts,

δ = δROM + δr , (12)

where δROM is stored in a ROM while δr is computed. Byexamining the Taylor series expansion of arctan(2−i), it be-comes obvious that the partial offset ε for iteration i and i+1

Table 2: The reference angles of the rotation mode and their corre-sponding values of δ for an operand word-length of 16 bits.

θref δ

−0.01640412 0.00008119

0.04607554 0.00009136

0.10746824 0.00064501

0.16994791 0.00065517

0.23230586 0.00072620

0.29478553 0.00073636

0.34871558 0.00502134

0.41119525 0.00503150

0.47355320 0.00510253

0.53603287 0.00511269

0.59742557 0.00566634

0.65990524 0.00567651

0.72226319 0.00574753

0.78474286 0.00575770

0.78605347 0.03635239

0.84853314 0.03636256

0.91089109 0.03643358

0.97337076 0.03644375

1.03476346 0.03699740

1.09724313 0.03700756

1.15960108 0.03707859

1.22208075 0.03708875

1.27601080 0.04137373

1.33849046 0.04138389

1.40084842 0.04145492

1.46332808 0.04146508

1.52472079 0.04201873

1.58720045 0.04202890

corresponds to

εi = 2−i − arctan(2−i)

=(− 2−3i

3+

2−5i

5− 2−7i

7+

2−9i

9− · · ·

),

(13)

εi+1 = 2−i−1 − arctan(2−i−1) (14)

=(− 2−3i−3

3+

2−5i−5

5− 2−7i−7

7+

2−9i−9

9− · · ·

)

(15)

= 2−3

(− 2−3i

3+

2−5i−2

5− 2−7i−4

7+

2−9i−6

9− · · ·

).

(16)

By comparing (13) and (16), it can be seen that (13) is about23 times larger than (16). Assuming a word-length of n bitsand i > n/5� − 2, the factor is 23. Hence, the term ε n/5�−1 =−2−3( n/5�−1)/3 + 2−5( n/5�−1)/5 can be stored in a ROM and


the remaining offset δr is computed as

δr = n/3�∑

j= n/5�−1

dj · ε n/5�−1 · 2−3( j− n/5�+1) < 2−3( n/5�−1). (17)

The largest magnitude of δr is smaller than 2−3( n/5�−1).

Example for high precision

Assume that we have a word-length of 50 bits and θ =0.977384381116. Using the most significant 9 bits of θ,δROM = 0.03644501895249 can be obtained. Hence, d iscomputed according to

d = 0.5 · θ + 1− 0.5 · c + δROM + δr

= 0.5 · 0.97738438111600 + 1.08624514683872

+ 0.03644501895249 + 2.483521181241566e

− 09 · (1 + 2−18 + 2−21 + 2−24)= 1.61138235883274

= 1.1001110010000011100011011

1100100100010011011100012.

(18)

2.3. The rotation mode in hyperbolic coordinatesystems

Similar to the circular coordinate system, a simple corre-lation between the input angle θ and the directions of themicro-rotation can be obtained. Due to the incomplete rep-resentation of the hyperbolic rotation angle θi, some itera-tions have to be performed twice. In [2], it was recommendedthat every 4th, 13th, (3k + 1)th iteration should be repeatedto complete the angle representation.

Similar to the rotation mode in circular coordinate sys-tem, the rotation angle θ is equivalent to the summation of allmicro-rotation with their corresponding weight. This leadsto

θ =∞∑i=0

σi · arctanh(2−i)

+ σextra4 · arctanh

(2−4)

+ σextra13 · arctanh

(2−13) + σextra

40 · arctanh(2−40) + · · · .

(19)

Performing a Taylor series expansion and applying σi = 2di−1 results in

d = 0.5θ + 0.5c1 + sign(θ) · 2(0.5− arctanh(0.5)

)+(dextra

4 − 12

)· arctanh

(2−4)

+(dextra

13 − 12

)· arctanh

(2−13)

+(dextra

40 − 12

)· arctanh

(2−40)

= 0.5θ + c + dextra4 · arctanh

(2−4) + dextra

13 · arctanh(2−13)

+ dextra40 · arctanh

(2−40),

(20)

where c corresponds to c = 1−0.5∑∞

i=1(2−i−arctanh(2−i))−0.5 · (arctanh(2−4) + arctanh(2−13) + arctanh(2−40) + · · · ).Since these extra rotation are not known in advance, an ef-ficient high precision VLSI implementation is not possible.However, for signal processing applications using a word-length of less than 13 bits, the ROM size corresponds to only14 entries.

3. THE NOVEL ROTATION-CORDIC ARCHITECTURE

For an implementation with the operand word-length ofn bits, the pre-processing part consists of a ROM of 2 n/5−2�

entries in which the reference angles θref and the correspond-ing offsets δ are stored, respectively (see Figure 2). To avoid asecond access to the ROM in case of θref > θ the next smalleroffset δk−1 is additionally stored in the kth entry of the ROM.The ROM is accessed by the n/5−2�MSB bits of θ. A binarytree adder computes, whether θ is smaller or larger than thechosen reference angle θref and selects the corresponding off-set (either δk or δk−1). Using a 3 : 2 compressor and anotherfast binary tree adder, the two required additions to obtaindapprox = 0.5θ + c2 + δROM can be performed, where c2 corre-sponds to c+sign(θ)ε0. Using the bits d n/5�−1 to d n/3�, δr canbe computed according to (17) and has to be added to dapprox.For the worst case scenario, there is a possible ripple from thebit d3( n/5�−1) to the bit d( n/5�) which would call for a timeconsuming ripple adder. However, by employing an extra ro-tation for d3( n/5�−1)−1 this limitation can be resolved. Thisextra rotation corresponds to the overflow bit of the additionfrom the bits d

approx−3( n/5�−1)···n and δr . The additional rotation

also does not affect the scale factor, since 3( n/5� − 1) > n/2.For a precision of n ≤ 16 bits, there are less than 32 offsetswhich can be stored in a ROM and the additional overheadto compute δr can be removed.

The alternative architecture can be chosen by realizingthat the directions of the micro-rotations are required in amost significant bit first manner (see Figure 2). As in the pre-vious architecture, a fast binary adder is employed to deter-mine which offset has to be selected. A redundant sign digitadder adds 0.5θ, c, and δROM and an on-the-fly converterstarts converting resulting into the corresponding binaryrepresentation. Normally, the most significant bit cannot bedetermined until the least significant digit is converted. How-ever, such worst cases do not exist in the CORDIC imple-mentation, due to the redundant representations of the an-gles arctan(2−i), where

arctan(2−i)<

∞∑k=i+1

arctan(2−k

), (21)

as opposed to the binary representation

2−i >∞∑

k=i+1

2−k. (22)

Therefore, it is not possible that there are more than l − 1consecutive rotations in the same direction. In case that thereare l − 1 consecutive rotations in the same direction, the lth


θ

ROM

θref

θ

δkδk−1 2 words Fast adder

Muxessign

δROM δr

c2 θ

Redundant adder

εn/5−2 ∗ d[n/5 − 2 : n/3]

x y

On-the-fly converter σCORDIC rotations

x∗ y∗

Figure 2: The novel architecture for the rotation mode.

Table 3: The maximal number of consecutive rotation in the samedirection.

i θi l − 1 i θi l − 1 i θi l − 1

0 45 3 2 14.04 6 4 3.58 10

1 26.57 5 3 7.13 8 5 1.79 12

iteration has to be rotated into the opposite direction. Thishappens if the angle remainder zi ≈ 0. Table 3 shows themaximum number of consecutive unidirectional rotationsdepending on the iteration number i. This limitation leadsto a reduction in the complexity of the online converters andits most significant bits can already be used to start the rota-tions in the x/y datapath.

Example 2. Assuming an angle θ = 0.001. Hence, the angleremainder θi correspond to

θ0 = 0.001,

θ1 = θ0 − arctan(1) = −0.7844,

θ2 = θ1 + arctan(0.5) = −0.3208,

θ3 = θ2 + arctan(0.25) = −0.0758,

θ4 = θ3 + arctan(0.125) = 0.0486.

(23)

The next rotation has to be performed in the negative direc-tions, since θ4 > 0. Hence, it is not possible to obtain rotationsequence like σ0···4 = 01111 but it has to be σ0···4 = 01110.

3.1. Evaluation of the z-datapath

Delay analysis

In this paper, we assume a similar delay model as proposed in[14]. Nevertheless in [14], the unit delay is set to a gate delay

while in our evaluation the unit delay is set to a full-adder de-lay. Hence, the delays for 2-input (NAND, NOR) gate, XOR,multiplexer, register, and full-adder are 0.25, 0.5, 0.5, 0.5, and1tFA.

The determination of which offset has to be chosen con-sists of the delay of the decoder, the ROM, a fast binary n-bit tree adder and a multiplexer. Assuming a delay of log2(m)gate delays for the decoder, wherem corresponds to the num-ber of rows in the ROM (m < log2(n) + 1), one for theword-line driver and another for the ROM, log2(n) · tMux

for the fast binary adder and 0.5 · tFA for the multiplexer,we can obtain the correct value of δROM after a delay of(0.5 log2(n) + 1 + 0.25 log2(log2(n))) · tFA.

A 3 : 2 compressor can be employed to reduce the num-ber of partial products to two. An additional fast binary treeadder can compute the final value of dapprox. Hence, the entiredelay to obtain dapprox corresponds to

(0.5 log2(n) + 1 + 0.25 log2

(log2(n)

)+ 1 + 0.5 log2(n)

)· tFA =

(log2(n) + 2.25

) · tFA.

(24)

After obtaining the bits d n/5�−1 to d n/3�, δr can be computed.Since the value of δr is smaller than 2−3( n/5�−1) and the valueof dapprox + δr is not required before 23( n/5�)tFA the computa-tion of δr is not in the critical path.

Alternatively to the 3 : 2 compressor and the tree adder,a minimally redundant radix-4 sign digit adder can be em-ployed which has a delay of two full-adders. Hence, all outputdigits are available after these two full-adder delays. An addi-tional on-the-fly converter converts the digits into its equiva-lent binary representation starting with the MSD. It requiresa delay of multiplexer and four NANDs/NORs to convertone digit which results in 1.5tFA per digit (1 digit = 2 bits).The last digit is converted after a delay of (n/2 + 1) · 1.5tFA.As already described in Table 3, bit n/3 is stable as soon asthe last digit (corresponding to bit n) has been converted.Hence, the n/3 rotation can be performed after a delay of(n/2 + 1) · 1.5tFA. Therefore, the iterations i = 0 can alreadybe performed after a delay of (n/2 + 1) · 1.5tFA − n/3 · 2tFA =(1/12 · n + 1)tFA. Note that the conversion of one redundantdigit is performed faster than the addition/subtraction of thex/y datapath. Hence, an initial delay of (1/12 · n + 1)tFA +(log2(n) + 2.25)tFA = (1/12 · n + log2(n) + 3.25)tFA has to beadded to the delay of the x/y datapath.

Area analysis

Previously, the area of the z-datapath consists of n/2 itera-tions in which (n+log2 n+2) multiplexers and (n+log2 n+2)full-adders and registers are employed. Additionally, due tothe Booth encoding, in the last n/4 iterations, about 2(n +log2 n + 2) multiplexers and (n + log2 n + 2) full-adders arerequired. Assuming AFA = 1.93 · Amux and AFA = 1.61 · Areg

(values are based on layouts), the hardware complexity of thez-datapath results inAz = 1.7·n(n+log2 n+2)AFA. Assuminga word-length of 54 bits and neglecting the required area for


the examination of the most significant three digits, about5700AFA are required.

The proposed architecture utilizes a ROM of word-lengthn and 2 n/5−2� entries, requiring an area of n · 2 n/5−2� · AFA ·1/50 resulting in 552AFA for a word-length of 54 bits. The im-plementation of the decoders can be done in multiple ways.NOR based decoders with precharge lead to the fastest imple-mentation. However, the decoder area becomes larger. Thedecoder size per word-line corresponds to Adec = 0.83AFA.Since 2 n/5�−2 decoders are required, the area for all decodercorresponds to Adec,total = 0.83 · 2 n/5�−2� = 424AFA, as-suming a 54 bit word-length. The ROM has to store θref,δk, and δk−1. This results in a total area for the ROM andthe decoder of about 2080AFA. The computation of δr re-quires n/3 − n/5 + 2 = 2n/15 + 2 rows of CSA (carry-save-adders) and Muxes and a final fast binary tree adder. Notethat each row of CSA adders and Muxes only consists of(n − 3n/5 + 6 = 2n/5 + 6) bits (the more significant bitsare zero). The required area corresponds to 10 · 27AFA +10 · 27Amux and 5 · 27AFA, respectively. Hence, the compu-tation of δr requires 540AFA. Moreover, the two redundantsign digit adder require 2n · AFA, while the converter con-sists of about (0.5n2 + n)Amux. This corresponds to 108 and696AFA for a word-length of 54 bits. This makes a total of3426AFA, which is about 60% of the z-datapath previouslyemployed.

3.2. Evaluation of the x/y datapath

In the first n/2 micro-rotations, the critical path of the x/yrotator part consists of a multiplexer and a 4 : 2 compres-sor, which has a combined critical path of 2 full-adders. Thelast n/2 micro-rotations can be performed only using n/4 it-erations, since Booth encoding can be employed. However,the delay of the selection for the multiple of the shifted x/ycomponents requires slightly more time, resulting in a delayof about one full-adder delay. The delay for the 4 : 2 com-pressor remains 1.5 full-adder. Hence, the critical path of theentire x/y rotator part consists of n/2 · 2tFA +n/4 · 2.5 · tFA =1.625n · tFA. Note that the direction of the first iteration isalready known; hence, the first iteration is not in the criticalpath. Therefore, the critical path of the entire x/y rotator partconsists of (1.625n− 2)tFA.

As an example, for a word-length of n = 16 bits, the x/ydatapath delay and the entire delay of the CORDIC algorithmcorresponds to 24 and 32.5 full-adder delays, respectively.

3.3. Scale factor compensation

Since the scale factor is constant, the x and y values can al-ready be scaled while the rotation direction is being com-puted. The scaling requires an adder of word-length (n +log2(n)) bits. Using a binary tree adder, this results in a de-lay of log2(n + log2(n)) · tMux. For the scale factor, a CSD(canonic signed digit) representation can be used, leadingto at most n/3 nonzero digits. Applying a Wallace-tree forthe partial product reduction, the total delay of the scal-ing results into (0.5 log2(n + log2(n)) + log1.5(n/3)) · tFA <(1/12 · n + log2(n) + 3.25) · tFA = tinitial. Hence, the scaling

of the x and y coordinates does not affect the total latency ofthe novel algorithm.

4. OVERVIEW OF PREVIOUSLY REPORTED CORDICALGORITHMS

The delay of every iteration can be decomposed into two dif-ferent time delays, td,σ and td,xy , where td,σ corresponds to thetime delay to predict the new rotation direction while td,xycorresponds to the time delay of the multiplexer/add struc-ture of the x/y datapath. Various implementations have beenproposed to obtain a speedup of the CORDIC algorithm. Im-provements have been especially made in the reduction oftd,σ .

In [17], the angle remainder has been decomposed everyk = 3k+ 1 iteration. From the given angle θ, the first four ro-tation directions can be immediately determined. After per-forming the corresponding addition/subtraction of the termsσi · αi from the input angle θ using CSA arithmetic, a fast bi-nary tree adder computes the nonredundant result z4. Thebits 4 to 13 of z4 deliver the rotation direction σ4 to σ13 whichare used to perform the rotation in the x/y datapath and thecomputation of the next angle remainder z40. Hence, a lowlatency CORDIC algorithm is obtained. However, a signif-icant reduction in latency is achieved at the cost of an ir-regular design. Furthermore, it is difficult to perform a π/2initial rotation or the rotation of index i = 0 for circularcoordinates, as it would force a conversion from redundantto conventional arithmetic for the z coordinate just after thefirst micro-rotation which is costly in time and area. Hence,this parallel and nonpipelined architecture only convergesin the range of [−1, 1]. The overall latency of this architec-ture corresponds to about 2n + log3(n) + log2(n) full-adderdelay.

In [18], a direct correlation between the z remainder af-ter n/3� rotations and the remaining rotation direction havebeen shown. Hence, no more examination of the direction ofthe micro-rotation has to be performed leading to a consider-able reduction in latency. However, in the first n/3� iterationa conventional method has to be employed.

In [19], the directions of the micro-rotation have beenrecoded using an offset binary coding (OBC) [20]. The ob-tained correlation is approximately piecewise linear sincesmall elementary angles can be approximated by a(i) =arctan(2−i) ≈ s · 2n−i−2, where s is the slope of the linearity.This is valid for i ≥ m, where m is an integer which makesthe approximation tolerable (normally m = n/3�). Hence,the following correlation can be obtained:

n−1∑i=m

σi2αi ≈ s ·n−1∑i=m

σi · 2−i−1. (25)

By performing some arithmetic computations, the followingcorrelation of the rotation direction can be obtained:

n−1∑i=0

σi · 2−i =n−1∑i=0

σi · 2−i−1 −n−1∑i=m

σi · 2α(i)s

. (26)


Hence, a multiplication by the inverse of the slope s is re-quired. This multiplication can be simplified to two stagesof addition for an operand word-length of 9 bits. However,in most digital signal processing application, the operandshave a word-length of up to 16 bits. Hence, for those applica-tions, the presented method requires more stages of additionto compensate the multiplication resulting in a more com-plex implementation and an increase in delay.

In [21], a double rotation method is introduced whichcompensates for the scale factor while performing the regularx/y rotations. However, due to the double rotation natureof this method, td,xy is increased to about twice its originalvalue.

To reduce the latency of the CORDIC operation, [22]proposed an algorithm using online arithmetic. However,this results in a variable scale factor. This drawback is re-moved in [23]. In every iteration a significant amount of timeis used to examine the most significant three digits to pre-dict σi. The employed random logic requires a delay of about1.5 full-adder delays. Since the x/y datapath consists of a 4-2compressor, it requires also a delay of 2 full-adders. Hence,the overall iteration delay corresponds to 3.5 full-adder de-lays. To maintain a constant scale factor, consecutive rota-tions are required in the first n/2, where n corresponds tothe word-length of the operands. For the computation of thelast n/2 bits, Booth encoding can be employed reducing thenumber of iterations by a factor of 2. However, the selectionof multiple of the shifted x and y operands requires an ad-ditional multiplexer delay and increases the overall iterationdelay to 4 full-adder delays. Hence, the number of iterationis equivalent to 0.75n which corresponds to a total latency of3n full-adders (this does not include the scale operation andthe conversion).

Other implementations like [24] remove the extra rota-tions by a branching mechanism in case that the sign of theremainder cannot be determined (most significant three dig-its are zero). Hence, no extra-rotations are required while therequired implementation area is doubled. Nevertheless, themost significant three digits (or most significant six bits) stillhave to be examined for the prediction of the next rotationdirection. In [25], the double step branching CORDIC algo-rithm is introduced which performs two rotations in a singlestep. Nevertheless, this method requires an examination ofthe most significant six digits to detect two rotation direc-tions. Since some of the digits can be examined in parallel,the delay increases only to 2tFA. The computation time of adouble rotation in the x/y datapath is slightly reduced com-pared to two normal x/y rotations. Hence, the total amountof computation time corresponds to 0.5n(2tFA + 3tFA) =2.5tFA.

In [26], the signs of all micro-rotations are computed se-rially. However, a speed up of the sampling rate is achieved byseparating the computation of the sign and the magnitude ofevery zi or yi remainder. The sign of every remainder is com-puted by a pipelined carry-ripple adder (CRA) leading to aninitial latency of n full-adders before the first CORDIC rota-tion can be performed. Nevertheless, after this initial latency,the following signs can be obtained with a delay of only one

Table 4: An overview between the proposed algorithm and otherCORDIC implementations.

Approach Delay in tFA

proposed 1.625n + 1/12 · n + log2(n) + 1.25

[14] 2n + 6

[26] 3n + 1

[21] 3.75n

[27] 5.25n

[25] 2.5n

[17] 2n + log3(n) + log2(n)

full-adder. This leads to an overall latency of 3n full-addersdelays.

In comparison to the CORDIC implementations withconstant scale factor, other implementations use a minimallyredundant radix-4 or an even higher radix number repre-sentation [12, 13, 14]. By using this number system, thenumber of iterations can be reduced. However, the predic-tion of the σi becomes more complicated, since there aremore possible values for σi. In addition, the scale factor be-comes variable and has to be computed every time, due tothe absence of consecutive rotations. An online computa-tion of the scale factor and a parallel scaling of the x andy operands can be achieved. Depending of the use of CSAor fast carry-propagate-adders (CCLA), the number of it-erations can be reduced to 2 n/3� + 4 and n/2 + 1, respec-tively. The iteration delay td,CSA of the architecture using CSAadders corresponds to the same delay as already describedfor the last n/2 iteration in the constant scale factor usingBooth-encoding, while the architecture employing the fastCCLA adders requires 1.5·d,CSA [14]. Hence, the overall la-tency of these CORDIC algorithm using a minimally redun-dant radix-4 digit set corresponds to about 2n full-adderdelays.

Table 4 provides a delay comparison between the pro-posed algorithm and other CORDIC implementations. Someof the delays have been taken from [14, 17, 26].

5. CONCLUSION

This paper presented a CORDIC algorithm for the rotationmode which computes the directions of the required micro-rotation before the actual CORDIC computations start whilemaintaining a constant scale factor. This is obtained by us-ing a linear correlation between the rotation angle θ and thecorresponding direction of all micro-rotations for the rota-tion mode. The rotation directions are obtained by addingthe rotation angle θ to a constant and a variable offset whichis stored in a ROM. An implementation for high precision isalso provided which reduces the size of the required ROM.Hence, neither extra or double rotations nor a variable scalefactor are required. The implementation is suitable for word-lengths up to 54 bits, while maintaining a reasonable ROMsize.


ACKNOWLEDGMENT

This work was supported by the Defense Advanced ResearchProjects Agency under contract number DA/DABT63-96-C-0050. Prof. Parhi is on leave from the Department of Elec-trical and Computer Engineering of the University of Min-nesota, Minneapolis, MN, USA.

REFERENCES

[1] J. E. Volder, “The CORDIC trigonometric computing tech-nique,” IRE Transactions on Electronic Computers, vol. 8, no.3, pp. 330–334, 1959.

[2] J. S. Walther, “A unified algorithm for elementary functions,”in Proc. Spring Joint Computer Conference, vol. 38, pp. 379–385, Arlington, Va, USA, 1971.

[3] L. W. Chang and S. W. Lee, “Systolic arrays for the discreteHartley transform,” IEEE Trans. Signal Processing, vol. 39, no.11, pp. 2411–2418, 1991.

[4] W.-H. Chen, C. H. Smith, and S. C. Fralick, “A fast compu-tational algorithm for the discrete cosine transform,” IEEETrans. Communications, vol. 25, no. 9, pp. 1004–1009, 1977.

[5] A. M. Despain, “Fourier transform computers using CORDICiterations,” IEEE Trans. on Computers, vol. 23, no. 10, pp. 993–1001, 1974.

[6] Y. H. Hu and S. Naganathan, “A novel implementation ofchirp Z-transform using a CORDIC processor,” IEEE Trans-action on Acoustics, Speech, and Signal Processing, vol. 38, no.2, pp. 352–354, 1990.

[7] M. Ercegovac and T. Lang, “Redundant and on-line CORDIC:Application to matrix triangularization and SVD,” IEEETrans. on Computers, vol. 39, no. 6, pp. 725–740, 1990.

[8] P. P. Vaidyanathan, “A unified approach to orthogonal digi-tal filters and wave digital filters, based on LBR two-pair ex-traction,” IEEE Trans. Circuits and Systems, vol. 32, no. 7, pp.673–686, 1985.

[9] Y. H. Hu and H. M. Chern, “VLSI CORDIC array structureimplementation of Toeplitz eigensystem solver,” in Proc. IEEEInt. Conf. Acoustics, Speech, Signal Processing, pp. 1575–1578,Alberquerque, NM, USA, April 1990.

[10] T. Y. Sung and Y. H. Hu, “Parallel VLSI implementation ofKalman filter,” IEEE Trans. on Aerospace and Electronics Sys-tems, vol. 23, pp. 215–224, March 1987.

[11] H. V. Poor and X. Wang, “Code-aided interference sup-pression for DS/CDMA communications—Part I: Interfer-ence suppression capability,” IEEE Trans. Communications,vol. 45, no. 9, pp. 1101–1111, 1997.

[12] C. Li and S. G. Chen, “A radix-4 redundant CORDIC algo-rithm with fast on-line variable scale factor compensation,”in International Symposium on Circuits and systems, pp. 639–642, Hong Kong, June 1997.

[13] R. Osorio, E. Antelo, J. Villalba, J. D. Bruguera, and E. L. Za-pata, “Digit on-line large radix CORDIC rotator,” in Proc.Int. Conf. Application-Specific Array Processors, pp. 246–257,Strasbourg, France, July 1995.

[14] J. Villalba, J. Hidalgo, E. L. Zapata, E. Antelo, and J. D.Bruguera, “CORDIC architectures with parallel compensa-tion of the scale factor,” in Proc. Int. Conf. Application SpecificArray Processors, pp. 258–269, Strasbourg, France, July 1995.

[15] M. Kuhlmann and K. K. Parhi, “A high-speed CORDIC al-gorithm and architecture for digital signal processing appli-cations,” in Proc. 1999 IEEE Workshop on Signal ProcessingSystems: Design and Implementation, pp. 732–741, Taipei, Tai-wan, October 1999.

[16] M. Kuhlmann and K. K. Parhi, “A new CORDIC rotationmethod for generalized coordinate systems,” in Proc. 1999

Asilomar Conf. on Signals, Systems and Computers, PacificGrove, Calif, USA, October 1999.

[17] D. Timmermann, H. Hahn, and B. J. Hosticka, “Low latencytime CORDIC algorithms,” IEEE Trans. on Computers, vol.41, no. 8, pp. 1010–1015, 1992.

[18] S. Wang, V. Piuri, and E. Swartzlander, “Hybrid CORDICalgorithms,” IEEE Trans. on Computers, vol. 46, no. 11, pp.1202–1207, 1997.

[19] S. Nahm and W. Sung, “A fast direction sequence gen-eration method for CORDIC processors,” in Proc. IEEEInt. Conf. Acoustics, Speech, Signal Processing, pp. 635–638,Munich, Germany, April 1997.

[20] N. Demassieux and F. Jutand, VLSI Implementation for ImageCommunications, chapter 7, P. Pirsch, Ed., Elsevier Science,New York, NY, USA, 8th edition, 1993.

[21] N. Takagi, T. Asada, and S. Yajima, “Redundant CORDICmethods with a constant scale factor for sine and cosine com-putation,” IEEE Trans. on Computers, vol. 40, no. 9, pp. 989–995, 1991.

[22] H. X. Lin and H. J. Sips, “On-line CORDIC algorithms,” IEEETrans. on Computers, vol. 39, no. 8, pp. 1038–1052, 1990.

[23] R. Hamill, J. McCanny, and R. Walke, “On-line CORDIC al-gorithm and VLSI architecture for implementing QR-arrayprocessors,” to appear in Journal of VLSI Signal Processing,1999.

[24] J. Duprat and J.-M. Muller, “The CORDIC algorithm: Newresults for fast VLSI implementation,” IEEE Trans. on Com-puters, vol. 42, no. 2, pp. 168–178, 1993.

[25] D. S. Phatak, “Double step branching CORDIC: A new al-gorithm for fast sine and cosine generation,” IEEE Trans. onComputers, vol. 47, no. 5, pp. 587–602, 1998.

[26] H. Dawid and H. Meyr, “The differential CORDIC algorithm:Constant scale factor redundant implementation without cor-recting iterations,” IEEE Trans. on Computers, vol. 45, no. 3,pp. 307–318, 1996.

[27] J.-A. Lee and T. Lang, “A constant-factor redundant CORDICfor angle calculation and rotation,” IEEE Trans. on Computers,vol. 41, no. 8, pp. 1016–1025, 1992.

Martin Kuhlmann received his DiplomeIngenieur and Ph.D. degrees in electrical en-gineering from the University of Technol-ogy Aachen, Germany in 1997 and fromthe University of Minnesota in 1999, respec-tively. Currently, he is a staff design engi-neer at Broadcom Corporation, Irvine, CA,USA. His research interests include com-puter arithmetic, digital communication,VLSI design, and deep-submicron crosstalk.

Keshab K. Parhi is a distinguished McK-night University Professor of Electrical andComputer Engineering at the Universityof Minnesota, Minneapolis, where he alsoholds the Edgar F. Johnson Professorship.He received the B.Tech., M.S.E.E., andPh.D. degrees from the Indian Instituteof Technology, Kharagpur (India) (1982),the University of Pennsylvania, Philadelphia(1984), and the University of California at Berkeley (1988), respec-tively. His research interests include all aspects of physical layerVLSI implementations of broadband access systems. He is currentlyworking on VLSI adaptive digital filters, equalizers and beam-formers, error control coders and cryptography architectures, low-power digital systems, and computer arithmetic.


Frequency Spectrum Based Low-Area Low-PowerParallel FIR Filter Design

Jin-Gyun ChungDivision of Electronic and Information Engineering, Chonbuk National University, Chonju 561-756, KoreaEmail: [email protected]

Keshab K. ParhiDepartment of Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN 55455, USAEmail: [email protected]

Received 11 July 2001 and in revised form 15 May 2002

Parallel (or block) FIR digital filters can be used either for high-speed or low-power (with reduced supply voltage) applications.Traditional parallel filter implementations cause linear increase in the hardware cost with respect to the block size. Recently, anefficient parallel FIR filter implementation technique requiring a less-than linear increase in the hardware cost was proposed. Thispaper makes two contributions. First, the filter spectrum characteristics are exploited to select the best fast filter structures. Second,a novel block filter quantization algorithm is introduced. Using filter benchmarks, it is shown that the use of the appropriate fastFIR filter structures and the proposed quantization scheme can result in reduction in the number of binary adders up to 20%.

Keywords and phrases: parallel FIR filter, quantization, fast FIR algorithm, canonic signed digit.

1. INTRODUCTION

Finite impulse response (FIR) filters are widely used in var-ious DSP applications. In some applications, the FIR filtercircuit must be able to operate at high sample rates, while inother applications, the FIR filter circuit must be a low-powercircuit operating at moderate sample rates. The low-poweror low-area techniques developed specifically for digital fil-ters can be found in [1, 2, 3, 4, 5, 6, 7].

Parallel (or block) processing can be applied to digitalFIR filters to either increase the effective throughput or re-duce the power consumption of the original filter. While se-quential FIR filter implementation has been given extensiveconsideration, very little work has been done that deals di-rectly with reducing the hardware complexity or power con-sumption of parallel FIR filters.

Traditionally, the application of parallel processing to anFIR filter involves the replication of the hardware units thatexist in the original filter. If the area required by the origi-nal circuit is A, then the L-parallel circuit requires an area ofL×A. Recently, an efficient parallel FIR filter implementationtechnique requiring a less-than linear increase in the hard-ware cost was proposed using FFAs (fast FIR Algorithms) [8].

In [9], it was shown that the power consumption of arith-metic units can be reduced if statistical properties of the in-put signals are exploited. In this paper, based on [10], it is

shown that the hardware cost can be reduced by exploitingthe frequency spectrum characteristics of the given trans-fer function. This is achieved by selecting appropriate FFAstructures out of many possible FFA structures all of whomhave similar hardware complexity at the word-level. How-ever, their complexity can differ significantly at the bit-level.For example, in narrowband low-pass filters, the signs ofconsecutive unit sample response values do not change muchand therefore their difference can require fewer number ofbits than their sum. This favors the use of a parallel structurewhich requires subfilters which require difference of consec-utive unit sample response values as opposed to sum.

In addition to the appropriate selection of FFA structures,proper quantization of subfilters is important for low-poweror low hardware cost implementation of parallel FIR filters. Itis shown in [5, 6, 7] that if the filter coefficients are first scaledbefore the quantization process is performed, the resultingfilter will have much better frequency-space characteristics.When the quantized filter is implemented, a postprocessingscale factor (PPSF) is used to properly adjust the magnitudeof the filter output. In cases where large levels of parallelismare used, the number of required subfilters is large, and con-sequently the PPSFs can contribute to a significant amountof hardware overhead. In [8], PPSFs are restricted to a set ofsimple values to reduce the hardware overhead due to PPSFs.Since the original PPSF is replaced with the new simple PPSF

Frequency Spectrum Based Low-Area Low-Power Parallel FIR Filter Design 945

that is the nearest in value, the quantized filter coefficientsmust also be properly modified. However, this approach isnot guaranteed to give optimal quantized coefficients sincealready quantized coefficients are modified again. To avoidthis problem, we propose look-ahead maximum absolute dif-ference (LMAD) quantization algorithm, which gives optimalquantized coefficients for a given simple PPSF value.

In Section 2, FFAs are briefly reviewed. Also, frequencyspectrum related hardware complexities for different types ofFFAs are discussed. Section 3 presents a quantization methodsuitable for block FIR filters. Section 4 presents several blockfilter design examples.

2. FAST FIR ALGORITHMS

Consider the general formulation of a length-N FIR filter,

yn =N−1∑i=0

hixn−i, n = 0, 1, 2, . . . ,∞, (1)

where {xi} is an infinite length input sequence and {hi} arethe length-N FIR filter coefficients. Then the polyphase rep-resentation of a traditional L-parallel FIR filter [11] can beexpressed as

L−1∑i=0

Yi(zL)z−i =

L−1∑j=0

Hj(zL)z− j

L−1∑k=0

Xk(zL)z−k, (2)

where Yi(z) = ∑∞m=0 z

−mymL+i, Hi(z) = ∑N/L−1m=0 z−mhmL+i,

Xi(z) = ∑∞m=0 z

−mxmL+i, for i = 0, 1, . . . , L − 1. This blockFIR filtering equation shows that the parallel FIR filter can berealized using L2-FIR filters of length N/L. This linear com-plexity can be reduced using various FFA structures.

2.1. 2× 2 (L = 2) FFAs

From (2) with L = 2, we have

Y0 + z−1Y1 =(H0 + z−1H1

)(X0 + z−1X1

),

= H0X0 + z−1(H0X1 +H1X0)

+ z−2H1X1,(3)

which implies that

Y0 = H0X0 + z−2H1X1,

Y1 = H0X1 +H1X0.(4)

Direct implementation of (4) is shown in Figure 1. Thisstructure computes a block of 2 outputs using 4 length N/2FIR filters and 2 postprocessing additions, which requires 2Nmultipliers and 2N − 2 adders.

If (4) is written in a different form, the (2×2) FFA0 (FFA-type 0) is obtained,

Y0 = H0X0 + z−2H1X1,

Y1 = H0+1X0+1 −H0X0 −H1X1,(5)

where Hi+ j = Hi +Hj and Xi+ j = Xi +Xj . Implementation of(5) is shown in Figure 2. This structure computes a block of

DH1

H0x(2k + 1)

y(2k + 1)

H1

H0x(2k)

y(2k)

Figure 1: Traditional 2-parallel FIR filter.

DH1

H0 + H1

x(2k + 1)

y(2k + 1)−

−

H0x(2k)

y(2k)

Figure 2: 2-parallel FIR filter using FFA0.

DH1

H0 −H1

x(2k + 1)

y(2k + 1)−

−

H0x(2k) y(2k)


2 outputs using 3 length N/2 FIR filters and 4 preprocessingand postprocessing additions, which requires 3N/2 multipli-ers and 3(N/2− 1) + 4 adders.

By a simple modification of (5), the following FFA1(FFA-type 1) is derived [11],

Y0 = H0X0 + z−2H1X1,

Y1 = −H0−1X0−1 +H0X0 +H1X1.(6)

In (6), H0−1 = H0 − H1 and X0−1 = X0 − X1. The structurederived by FFA1 is shown in Figure 3. The structures derivedby FFA0 and FFA1 are essentially the same except some signchanges. Notice that, in FFA1, H0−1 is used instead of H0+1.

When an FIR filter is implemented using a multiplierlessapproach, the hardware complexity is directly proportionalto the number of nonzero bits in the filter coefficients. If thesigns of the given impulse response sequences do not changefrequently as in the narrowband low-pass filter cases, the co-efficient magnitudes of H0 + H1 are likely to be larger thanthose of H0 − H1. Then, H0 + H1 has more nonzero bits inthe coefficients than H0 − H1. (See examples in Section 4.)If the signs of the given impulse response sequences changefrequently as in the wide-band low-pass filter cases, H0 −H1

is likely to have more nonzero bits in the coefficients than


H0 +H1. Thus, to achieve minimum hardware cost, it is nec-essary to select either FFA0 or FFA1 depending upon the fre-quency spectrum specifications.

2.2. 3× 3 (L = 3) FFAs

The (3×3) FFA produces a parallel filtering structure of blocksize 3. From (2) with L = 3, we have

Y0 = H0X0 + z−3(H1X2 +H2X1),

Y1 =(H0X1 +H1X0

)+ z−3H2X2,

Y2 = H0X2 +H1X1 +H2X0.

(7)

Direct implementation of (7) computes a block of 3 outputsusing 9 lengthN/3 FIR filters and 6 postprocessing additions,which requires 3N multipliers and 3N − 3 adders.

By a similar approach as in (2×2) FFA0, following (3×3)FFA0 is obtained,

Y0=H0X0 − z−3H2X2 + z−3[H1+2X1+2 −H1X1],

Y1=[H0+1X0+1 −H1X1

]− [H0X0 − z−3H2X2],

Y2=H0+1+2X0+1+2−[H0+1X0+1−H1X1

]−[H1+2X1+2−H1X1].

(8)

Figure 4 shows the filtering structure that results from the(3 × 3) FFA0. This structure computes a block of 3 outputsusing 6 length N/3 FIR filters and 10 preprocessing and post-processing additions, which requires 6(N/3) multipliers and6(N/3 − 1) + 10 adders. Notice that (3 × 3) FFA0 structureprovides a saving of approximately 33% over the traditionalstructure.

The (3×3) FFA1 structure can be obtained by modifying(8) as follows:

Y0=H0X0 + z−3H2X2 − z−3[H2−1X2−1 −H1X1],

Y1=−[H0−1X0−1 −H1X1

]+[H0X0 + z−3H2X2

],

Y2=H0−1+2X0−1+2−[H0−1X0−1−H1X1

]−[H2−1X2−1−H1X1].

(9)

Figure 5 shows the filtering structure that results from the(3× 3) FFA1.

We propose the following (3×3) FFA2 structure which isefficient when the coefficient magnitudes of H0−2 are smallerthan those of H0−1+2 or H0+1+2,

Y0 = H0X0 + z−3(H2X2 +H1X1 −H2−1X2−1),

Y1 = −H0−1X0−1 +H1X1 +H0X0 + z−3H2X2,

Y2 = −H0−2X0−2 +H0X0 +H1X1 +H2X2.

(10)

Figure 6 shows the filtering structure that results from the(3× 3) FFA2.

2.3. Cascading FFAs

The (2 × 2) and (3 × 3) FFAs can be cascaded together toachieve higher levels of parallelism. The cascading of FFAs isa straightforward extension of the original FFA application[8]. For example, an (m × m) FFA can be cascaded with an

(n×n) FFA to produce an (m×n)-parallel filtering structure.The set of FIR filters that result from the application of the(m ×m) FFA are further decomposed, one at a time, by theapplication of the (n×n) FFA. The resulting set of filters willbe of length N/(m× n).

For example, the (4× 4) FFA can be obtained by first ap-plying the (2× 2) FFA0 to (2) and then applying the (2× 2)FFA0 or the (2 × 2) FFA1 to each of the filtering operationsthat result from the first application of the FFA0. The result-ing (4 × 4) FFA structure is shown in Figure 7. Each filterblock F0, F0+F1, and F1 represents a (2×2) FFA structure andcan be replaced separately by either (2 × 2) FFA0 or (2 × 2)FFA1. Each filter block F0, F0 + F1, and F1 is composed ofthree subfilters as follows:

(i) F0 : H0, H2, H0 ±H2,(ii) F0 + F1 : H0 +H1, H2 +H3, (H0 +H1)± (H2 +H3),

(iii) F1 : H1, H3, H1 ±H3,

where

± =+, for FFA0,

−, for FFA1.(11)

When the filter block F0 + F1 is implemented using FFA1structure, the subfilters are H0+1, H2+3, and H0+1 − H2+3.Thus, even though FFA1 structure is used for slowly vary-ing impulse response sequences, optimum performance isnot guaranteed. In this case, better performance can be ob-tained by using the FFA1′ shown in Figure 8. Since the sub-filters in FFA1′ are H0−1, H2−3, and H0−1 − H2−3, the FFA1′

gives smaller number of nonzero bits than FFA1 for the caseof slowly varying impulse response sequences. Notice that theFFA1′ structure can be derived by first applying the (2 × 2)FFA1 (instead of the (2 × 2) FFA0) to (2). When the filterblock F0 + F1 in Figure 7 is replaced by FFA1′ in Figure 8,it can be shown that the outputs are y(4k), −y(4k + 1),y(4k + 2), and −y(4k + 3).

2.4. Selection of FFA types

For given length N unit sample response values {hi} andblock size L, the selection of best FFA type can be roughlydetermined by comparing the signs of the values in subfiltersH0, H1, . . . , HL−1.

For example, in the case of L = 2 and even N , H0, and H1

are

H0 ={h0, h2, . . . , hN−2

},

H1 ={h1, h3, . . . , hN−1

}.

(12)

From (12), the ith value ofH0 can be paired with the ith valueof H1 as (h0, h1), (h2, h3), . . . , (hN−2, hN−1). Comparing thesigns of the values in each pair, the number of pairs with op-posite signs and the number of pairs with the same signs canbe determined. If the number of pairs with opposite signs islarger than the number of pairs with the same signs,H0+H1 islikely to be more efficient than H0−H1. The sign-comparingprocedure can be extended to any block size of L with appro-priate modifications.


y(3k + 2)

y(3k + 1)

y(3k)

D

D

H0 + H1 + H2

H1 + H2

H0 + H1

H0

H1

H2

x(3k + 1)

x(3k + 2)

x(3k)

− −

−

− −

−


y(3k + 2)

y(3k + 1)

y(3k)

D

D

H0 −H1 + H2

H2 −H1

H0 −H1

H0

H1

H2

x(3k + 1)

x(3k + 2)

x(3k)

−

−

−

−

−

−


3. LOOK-AHEAD MAD QUANTIZATION

It is shown in [5, 6, 7] that if the filter coefficients are firstscaled before the quantization process is performed, the re-sulting filter will have much better frequency-space charac-teristics. The NUS algorithm [6] employs a scalable quanti-zation process. To begin the process, the ideal filter is nor-malized so that the largest coefficient has an absolute valueof 1. The normalized ideal filter is then multiplied by a vari-able scale factor (VSF). The VSF steps through the range ofnumbers from 0.4375 to 1.13 with a step size of 2−W , whereW is the coefficient word length. Signed power-of-two (SPT)terms are then allocated to the quantized filter coefficient that

represents the largest absolute difference between the scaledideal filter and the quantized filter. The NUS algorithm it-eratively allocates SPT terms until the desired number ofSPT terms is allocated or until the desired NPR, normalizedpeak ripple, specification is met. Once the allocation of termsstops, the NPR is calculated. The process is then repeated fora new scale factor. The quantized filter leading to the mini-mum NPR is chosen.

In parallel FIR filters, the NPR cannot be used as a selec-tion criteria for choosing the best quantized filter since pass-band/stopband ripples cannot be defined for the set of sub-filters obtained by the application of FFAs. In [8], it is shownthat the maximum absolute difference (MAD) between the


y(3k + 2)

y(3k + 1)

y(3k)

D

D

H2 −H1

H0 −H2

H0 −H1

H0

H1

H2

x(3k + 1)

x(3k + 2)

x(3k)

−−

− −

−


y(4k + 3)

y(4k + 1)

y(4k + 2)

y(4k)

D

F1

F0 + F1

F0

x(4k + 3)

x(4k + 1)

x(4k + 2) + x(4k + 3)

x(4k) + x(4k + 1)

x(4k + 2)

x(4k)

−− −

−

Figure 7: 4-parallel FIR filter structure.

y(4k + 3)

y(4k + 1)

D

H0 −H1

H0 −H1 −H2 + H3

H2 −H3x(4k + 2) − x(4k + 3)

x(4k) − x(4k + 1)

−−

Figure 8: FFA1′ structure.

frequency responses of the ideal filter and the quantized filtercan be used as an efficient selection criteria for parallel filters.

When the quantized filter is implemented, a postprocess-ing scale factor (PPSF) is used to properly adjust the magni-tude of the filter output. The PPSF is calculated as

PPSF = Max[Absolute(Ideal Filter Coeffs.)]VSF

. (13)

In the cases where large levels of parallelism are used,the PPSFs can contribute to a significant amount of hard-ware overhead. In [8], to reduce this hardware over-head the PPSFs are restricted to the following set of val-ues: {0.125, 0.25, 0.375, 0.5, 0.625, 0.75, 0.875, 1}. The origi-nal PPSF is replaced with the new PPSF that is the nearest invalue. Since the scale factor of the quantized filter is shifted invalue, the quantized coefficients must also be properly shifted


For each filter section in the parallel FIR filter{Normalize the set of filter coefficients so that the magnitude

of the largest coefficient is 1;For VSF = Lower Scale:Step Size:Upper Scale,

{Compute PPSF by (13);Convert PPSF into Canonic Signed Digit form;If (No. of nonzero bits in PPSF) < prespecified value,

{Scale normalized coefficients with VSF;Quantize the scaled coefficients using SPT term

allocation scheme in NUS algorithm;Calculate MAD between the frequency responses

of the ideal and quantized filters;}

}Choose scale factor that leads to the minimum MAD;}

Algorithm 1: Look-ahead MAD quantization.

in value. This is accomplished using the following three steps:

(i) determine effective coefficients, effective coeffs. =quantized coeffs.× PPSF;

(ii) determine shifted coefficients with new PPSF,shifted coeffs. = effective coeffs./new PPSF;

(iii) quantize the shifted coefficients.

However, the above steps are not guaranteed to give op-timal quantized coefficients for the new PPSF value. The rea-son is that the quantization in (iii) is performed on the al-ready quantized coefficients.

To avoid this problem, LMAD quantization algorithm isproposed. In the proposed algorithm, the PPSF for a givenVSF is computed by (13) before the quantization step be-gins. If the number of nonzero bits in the computed PPSFis less than a prespecified value, then the normalized coef-ficients are scaled by the VSF and the scaled coefficients arequantized. Otherwise, the procedure is repeated for the nextVSF value.

In [8], the number of nonzero bits in PPSF is fixed. How-ever, in the proposed approach, the number of nonzero bitsin PPSF can be varied and the PPSF value giving the best per-formance can be selected. From our simulation experience,increasing the number of nonzero bits in PPSF more thanthree does not improve the numerical performance signifi-cantly.

Example 1. Consider an ideal filter section with the fol-lowing coefficients [8]: ideal coeffs. = {−.0648 .1404 .4328− .0818 .0391}. In [8], these coefficients are quantized us-ing word length of 7 bits to the following values by the scal-able MAD quantization algorithm: {−.109375 .203125 .6875− .140625 .046875}with PPSF = 0.625. The computed MADvalue is 0.0360125. For comparison, the ideal coefficientsare quantized using the proposed algorithm with PPSF =

10.90.80.70.60.50.40.30.20.10Frequency

0.34

0.36

0.38

0.4

0.42

0.44

0.46

0.48

0.5

0.52

0.54

Mag

nit

ude

IdealBy [8]Proposed

Figure 9: Frequency responses of Example 1.

Table 1: The number of adders by the method used in [8] and bythe proposed method. The numbers inside parentheses denote theFFA types used for each case.

24-Tap FIR 72-Tap FIR

By [8] Proposed By [8] Proposed

L = 1 56 (0) 49 (0) 125 (0) 96 (0)

L = 2 74 (0) 54 (1) 192 (0) 173 (1)

L = 3 119 (0) 99 (1) 293 (0) 272 (1)

L = 4 133 (0-0-0) 123 (0-1-0) 313 (0-0-0) 303 (1-1-1)

0.625. The quantized coefficients are {−.109375 .21875 .6875− .140625 .0625}. The computed MAD value is 0.01648125.Notice that the MAD value by the proposed method is only45% of the MAD value in [8]. Frequency responses are com-pared in Figure 9.

Table 1 shows that, for the two low-pass FIR filter ex-amples in [8], the proposed method can save up to 24% ofadders. In [8], only FFA type 0 is used for each value of L.However, as can be seen from Table 1, better results are ob-tained by selecting FFA type(s) properly for each L.

Example 2. In this example, the hardware saving by the ap-propriate selection of FFA structures is compared with thehardware saving by the proposed LMAD quantizationscheme using a simple low-pass filter with filter order = 7,passband edge= 0.1π, maximum passband ripple= 0.02 dB,stopband edge = 0.3π, and minimum stopband attenuation= 22 dB. In this example, only block size of 2 (L = 2) is con-sidered.

Table 2 shows the filter coefficients obtained by FFA0without scaling. Table 3 shows the filter coefficients obtained


Table 2: Filter coefficients (canonic signed digit format) and the number of nonzero bits for FFA0 without scaling (word-length = 8).

H0 H0+1 H1

Coefficients

0 0 0 0 1 0 0 1 0 0 1 0 1 0 0 1 0 0 0 1 0 0 0 1

0 0 0 1 0 1 0 0 0 1 0 1 0 1 0 1 0 0 1 0 1 0 0 0

0 0 1 0 1 0 0 0 0 1 0 1 0 1 0 1 0 0 0 1 0 1 0 0

0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 1 0 0 0 0 1 0 0 1

Nonzero bits 8 14 8

Table 3: Filter coefficients and the number of nonzero bits for FFA0 with LMAD scaling (word-length = 7).

H0 H0+1 H1

Coefficients

0 0 1 0 0 1 0 0 0 1 0 1 0 0 0 1 0 0 0 1 0

0 1 0 1 0 0 0 0 1 0 0 1 0 0 1 0 1 0 0 0 0

1 0 1 0 0 0 0 0 1 0 0 1 0 0 0 1 0 1 0 0 0

0 1 0 0 0 1 0 0 0 1 0 1 0 0 0 0 1 0 0 1 0

Nonzero bits 8 8 8

Table 4: Filter coefficients and the number of nonzero bits for FFA1 with LMAD scaling (word-length = 7).

H0 H0−1 H1

Coefficients

0 0 1 0 0 0 1 1 0 1 0 0 0 0 0 1 0 1 0 0 1

0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0

0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0

0 1 0 1 0 0 1 1 0 1 0 0 0 0 0 0 1 0 0 0 1

Nonzero bits 8 6 8

by FFA0 with LMAD scaling. Notice that the filter coefficientsby FFA0 with LMAD scaling satisfy the given specificationsby word-length of 7 bits while the filter coefficients by FFA0without scaling require word-length of 8 bits. The reductionof the word-length is due to the use of scaling factors. ThePPSFs for the filter coefficients by FFA0 with LMAD scalingare 0010001(H0), 0101000(H0+1), and 0010001(H1). EachPPSF contains two nonzero bits, which corresponds to theoverhead of one adder. Table 4 shows the filter coefficientsobtained by FFA1 with LMAD scaling. The PPSFs for the fil-ter coefficients by FFA1 with LMAD scaling are 00101(H0),00001(H0+1), and 00101(H1). Frequency responses of idealfilter and the filter obtained by FFA1 quantized by LMAD arecompared in Figure 10.

To compare the hardware savings by the quantization andthe proper selection of FFA types, only H0+1 or H0−1 sub-filters are considered. From Table 2, the number of nonzerobits for H0+1 of nonscaled FFA0 filter is 14 while the numberof nonzero bits for H0+1 of scaled FFA0 filter is 10 (includingPPSF). Thus, in addition to the word-length reduction, hard-ware saving of about 28% can be obtained by LMAD scaling.

From Table 4, the number of nonzero bits for H0−1 ofscaled FFA1 filter is 7 (including PPSF). Thus, 22% furthersaving is obtained by the selection of proper filter type. Thus,in this example, about half of the saving is due to the LMADquantization and the other half is due to proper filter typeselection.

4. DESIGN EXAMPLES

In this section, three design examples with various frequencyspecifications are given.

Example 3. Consider a narrowband low-pass filter with filterorder = 35, passband edge = 0.2π, maximum passband rip-ple = 0.185 dB, stopband edge = 0.3π, and minimum stop-band attenuation = 33.5 dB. As can be seen from Figure 11,the signs of the impulse response sequences (designed by theRemez exchange algorithm) change slowly.

For L = 2, according to the discussions in Section 2.4, thenumber of pairs with the same signs is 16, while the num-ber of pairs with the opposite signs is only 2. Thus, FFA1 ismore efficient than FFA0. By the LMAD quantization algo-rithm, the number of nonzero bits required for H0+1 is 42but the number of nonzero bits required for H0−1 is 24. Thusthe hardware cost of H0−1 is about 57% of the hardware costof H0+1. The frequency responses for L = 2 are compared inFigure 12.

For L = 3, the number of pairs with the same signsin subfilter pairs {H0, H1}, {H1, H2}, and {H0+2, H1} is 28while the number of pairs with the opposite signs is 8. Also,the number of pairs with the same signs in subfilter pairs{H0, H1}, {H1, H2}, and {H0, H2} is 12. Thus, FFA1 is themost efficient.

For L = 4, the number of pairs with the opposite signsin subfilter pair {H0, H2} is 7 while the number of pairs with


10.90.80.70.60.50.40.30.20.10Frequency

10−4

10−3

10−2

10−1

100

101

Mag

nit

ude

IdealFFA1-LMAD

Figure 10: Frequency responses of ideal filter and the filter obtainedby FFA1 quantized by LMAD.

4035302520151050−0.05

0

0.05

0.1

0.15

0.2

0.25

Figure 11: Ideal impulse response of Example 3.

the same signs is 2. Thus, FFA0 is the most efficient for F0.The number of pairs with the opposite signs in the subfilterpair {H1, H3} is 7 while the number of pairs with the samesigns is 2. Thus, FFA0 is the most efficient for F1. By a similarprocedure, it can be shown that FFA1′ is the most efficientchoice for F0 + F1.

The design results for L = 2, 3, and 4 are summarized inTable 5. For L = 2 and L = 3, about 20% of the hardware canbe saved by a proper choice of FFA types. However, for L = 4,only 7% of the hardware saving can be achieved by a properchoice of FFA types. The main reason is that the correlationof filter coefficients between subfilters is reduced as the blocksize increases.

Example 4. Consider a wideband low-pass filter with filterorder = 62, passband edge = 0.8π, maximum passband rip-

10.90.80.70.60.50.40.30.20.10Frequency

10−5

10−4

10−3

10−2

10−1

100

101

Mag

nit

ude

IdealFFA0FFA1

Figure 12: Frequency responses of Example 3.

Table 5: Total number of nonzero bits for Example 3 with differentblock size and various structures (word-length = 10).

L = 2 L = 3 L = 4

FFA0 FFA1 FFA0 FFA1 FFA2 FFA0-0-0 FFA1-1-1 FFA0-1′-0

102 84 144 115 123 194 196 184

Table 6: Total number of nonzero bits for Example 4 with differentblock size and various structures (word-length = 9 for L = 2 andword-length = 10 for L = 3 and L = 4).

L = 2 L = 3 L = 4


170 193 244 286 284 284 285 295

ple = 0.27 dB, stopband edge = 0.85π, and minimum stop-band attenuation = 32.5 dB. As can be seen from Figure 13,the signs of the impulse response sequences change fre-quently. By the sign comparing procedure, the best FFA typesare predicted as FFA0 (L = 2), FFA0 (L = 3), and FFA1-FFA1-FFA1 (L = 4).

The design results for L = 2, 3, and 4 are summarizedin Table 6. For L = 2 and L = 3, about 12%–15% of thehardware can be saved by a proper choice of FFA types. ForL = 4, 4% of the hardware saving can be achieved by a properchoice of FFA types.

Example 5. Consider a narrow bandpass filter with filter or-der = 86, passband = 0.22π ∼ 0.3π, maximum passbandripple = 0.19 dB, stopband = 0 ∼ 0.18π, 0.34π ∼ π, andminimum stopband attenuation = 35 dB. Figure 14 shows


6050403020100−0.2

−0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8


9080706050403020100−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15


the impulse response sequence. By the sign-comparing pro-cedure, the best FFA types are predicted as FFA1 (L = 2),FFA2 (L = 3), and FFA0-FFA1′-FFA0 (L = 4). The design re-sults for L = 2, 3, and 4 are summarized in Table 7. For L = 2and L = 3, about 16%–18% of the hardware can be saved bya proper choice of FFA types. For L = 4, 4% of the hardwaresaving can be achieved by a proper choice of FFA types.

5. CONCLUSIONS

It has been shown that the hardware cost and power con-sumption of parallel FIR filters can be reduced significantlyby exploiting the frequency spectrum characteristics. For ex-ample, in narrowband low-pass filters, the signs of consec-utive unit sample response values do not change much andtherefore their difference (FFA1) can require fewer numberof bits than their sum (FFA0). In wideband low-pass filters,the signs of consecutive unit sample response values changefrequently and therefore their sum (FFA0) can require fewernumber of bits than their difference (FFA1). To determine

Table 7: Total number of nonzero bits for Example 5 with differentblock size and various structures (word-length = 12).

L = 2 L = 3 L = 4


299 252 413 345 339 461 474 458

the best FFA type for given impulse response sequence andblock size L, a sign-comparing procedure was proposed. Theusefulness of the proposed sign-comparing procedure wasdemonstrated by several examples. Also, the proposed look-ahead MAD quantization algorithm was shown to be veryefficient for the implementation of parallel FIR filters.

Substructure sharing is the process of examining thehardware implementation of the filter coefficients and shar-ing the hardware units that are common among the filter co-efficients. Using the substructure sharing techniques in [8],further savings in hardware cost and power consumption canbe achieved.

Developing a similar approach to power reduction ofadaptive FIR filters will be an interesting future research. Fur-ther research needs to be directed towards finite word-lengthanalysis of these low-power parallel FIR filters.

ACKNOWLEDGMENTS

This research was supported in part by Information andCommunication Research Institute at Chonbuk NationalUniversity and by NSF under grant number CCR-9988262.

REFERENCES

[1] J. W. Adams and A. N. Willson Jr., “Some efficient digitalprefilter structures,” IEEE Trans. Circuits and Systems, vol. 31,no. 3, pp. 260–265, 1984.

[2] J. T. Ludwig, S. H. Nawab, and A. P. Chandrakasan, “Low-power digital filtering using approximate processing,” IEEEJournal of Solid-State Circuits, vol. 31, no. 3, pp. 395–400,1996.

[3] N. Sankarayya, K. Roy, and D. Bhattacharya, “Algorithms forlow-power high speed FIR filter realization using differentialcoefficients,” IEEE Trans. on Circuits and Systems II: Analogand Digital Signal Processing, vol. 44, no. 6, pp. 488–497, 1997.

[4] N. R. Shanbhag and M. Goel, “Low-power adaptive filter ar-chitectures and their application to 51.84 Mb/s ATM-LAN,”IEEE Trans. Signal Processing, vol. 45, no. 5, pp. 1276–1290,1997.

[5] H. Samueli, “An improved search algorithm for the designof multiplierless FIR filters with powers-of-two coefficients,”IEEE Trans. Circuits and Systems, vol. 36, no. 7, pp. 1044–1047,1989.

[6] D. Li, J. Song, and Y. C. Lim, “A polynomial-time algorithmfor designing digital filters with power-of-two coefficients,” inProc. IEEE International Symposium on Circuits and Systems,vol. 1, pp. 84–87, Chicago, Ill, USA, May 1993.

[7] C.-L. Chen, K.-Y. Khoo, and A. N. Willson Jr., “An improvedpolynomial-time algorithm for designing digital filters withpower-of-two coefficients,” in Proc. IEEE International Sym-posium on Circuits and Systems, vol. 1, pp. 223–226, Seattle,Wash, USA, 30 April–3 May 1995.


[8] D. A. Parker and K. K. Parhi, “Low-area/power parallel FIRdigital filter implementations,” Journal of VLSI Signal Process-ing, vol. 17, no. 1, pp. 75–92, 1997.

[9] M. Winzker, “Low-power arithmetic for the processing ofvideo signals,” IEEE Trans. on VLSI Systems, vol. 6, no. 3, pp.493–497, 1998.

[10] J.-G. Chung, Y.-B. Kim, H.-J. Jeong, K. K. Parhi, and Z. Wang,“Efficient parallel FIR filter implementations using frequencyspectrum characteristics,” in Proc. IEEE International Sympo-sium on Circuits and Systems, vol. 5, pp. 483–486, Monterey,Calif, USA, 31 May–3 June 1998.

[11] Z. J. Mou and P. Duhamel, “Short-length FIR filters and theiruse in the fast nonrecursive filtering,” IEEE Trans. Signal Pro-cessing, vol. 39, no. 6, pp. 1322–1332, 1991.

Jin-Gyun Chung received his B.S. degree inelectronic engineering from Chonbuk Na-tional University, Chonju, South Korea, in1985 and the M.S. and Ph.D. degrees inelectrical engineering from the Universityof Minnesota, Minneapolis, Minnesota, in1991 and 1994, respectively. Since 1995, hehas been with the Department of Electronicand Information Engineering at ChonbukNational University, where he is currentlyAssociate Professor. His research interests are in the area of VLSIarchitectures and algorithms for signal processing and communi-cation systems, which include the design of high-speed and low-power algorithms for digital filters, DSL systems, OFDM systems,and ultrasonic NDE systems.

Keshab K. Parhi is a distinguished McK-night University Professor of Electrical andComputer Engineering at the Universityof Minnesota, Minneapolis, where he alsoholds the Edgar F. Johnson Professorship.He received the B.Tech., M.S.E.E., andPh.D. degrees from the Indian Instituteof Technology, Kharagpur (India, 1982),the University of Pennsylvania, Philadelphia(1984), and the University of California atBerkeley (1988), respectively. His research interests include all as-pects of physical layer VLSI implementations of broadband accesssystems. He is currently working on VLSI adaptive digital filters,equalizers and beamformers, error control coders and cryptogra-phy architectures, low-power digital systems, and computer arith-metic. He has published over 330 papers in these areas. He has au-thored the text book “VLSI Digital Signal Processing Systems” (Wi-ley, 1999) and coedited the reference book “Digital Signal Process-ing for Multimedia Systems” (Dekker, 1999). Dr. Parhi has been aVisiting Professor at the Delft University of Technology and at LundUniversity. He has been a Visiting Researcher at the NEC Corpora-tion, Japan, and a Technical Director—DSP Systems in the Officeof CTO at Broadcom Corporation in Irvine, Calif. Dr. Parhi hasserved on editorial boards of the IEEE Transactions on Circuits andSystems, Signal Processing, Circuits and Systems—Part II: Analogand Digital Signal Processing, VLSI Systems, and the IEEE SignalProcessing Letters. He is an editor of the Journal of VLSI SignalProcessing. He has received numerous best paper awards includingthe 2001 IEEE W. R. G. Baker prize paper award. He has been aDistinguished Lecturer (1994–1999) of and a recipient of a GoldenJubilee medal (1999) from the IEEE Circuits and Systems Society.He received a Young Investigator Award from the National ScienceFoundation in 1992, and was elected a Fellow of IEEE in 1996.


Low-Complexity Versatile Finite Field Multiplierin Normal Basis

Hua LiDepartment of Mathematics and Computer Science, University of Lethbridge, Lethbridge, Alberta, Canada T1K 3M4Email: [email protected]

Chang Nian ZhangDepartment of Computer Science, TRLabs, University of Regina, Regina, SK, Canada S4S 0A2Email: [email protected]

Received 6 August 2001 and in revised form 30 August 2002

A low-complexity VLSI array of versatile multiplier in normal basis over GF(2n) is presented. The finite field parameters can bechanged according to the user’s requirement and make the multiplier reusable in different applications. It increases the flexibility touse the same multiplier for different applications and reduces the user’s cost. The proposed multiplier has a regular structure andis very suitable for high speed VLSI implementation. In addition, the pipeline versatile multiplier can be modified to a low-costarchitecture which is feasible in embedded systems and restricted computing environments.

Keywords and phrases: finite field multiplication, Massey-Omura multiplier, normal basis, VLSI, encryption.

1. INTRODUCTION

The finite fields GF(2n) of characteristic 2 are of great inter-est for cryptosystems and digital signal processing. The addi-tion operation in GF(2n) is fast and inexpensive as it can berealized with n bitwise XOR operations. The multiplicationoperation is costly in terms of gate number and time delay.There have been three main kinds of basis representationsof the field elements in GF(2n): standard (canonical, poly-nomial) basis, dual basis, and normal basis. Different basisrepresentation multipliers have their own benefits and trade-offs. The dual basis multiplier [1] needs the least number ofgates which leads to the smallest area required for VLSI im-plementation [2]. The normal basis multiplier, for example,Massey-Omura multiplier [3], is very effective in performingsquaring, exponentiation, and inversion operation. The stan-dard basis multiplier [4, 5, 6, 7] is easier to extend to high-order finite fields than the dual or normal basis multipliers.

Most of the proposed finite field multipliers operate overa fixed field. In other words, a new multiplier is needed ifthere is a change in the field parameters such as the irre-ducible polynomial defining the representation of the fieldelements. This makes the multiplier not reusable. There arefew versatile multipliers [4, 6, 8, 9] reported and all basedon canonical basis. In this paper, we present a new VLSI ar-ray of versatile pipeline multiplier based on the normal ba-sis representation. In normal basis, the squaring is a cost-free

cyclic shift operation and the inversion (the most compli-cated operation among the important finite field arithmeticoperations) can be effectively computed by Fermat’s theo-rem which requires recursive squaring and multiplication[10, 11]. Three main advantages accrue from the proposedpipelined versatile multiplier. First, the finite field parameterscan be changed according to the application environments. Itincreases the flexibility to use the same multiplier for differ-ent applications. Secondly, the structure of the multiplier canbe easily extended to higher-order finite fields. Thirdly, thebasic architecture of the proposed multiplier can be modi-fied to a low-cost multiplier which is very suitable for bothembedded systems and wireless devices with restricted hard-ware resources. Moreover, the structure of the multiplier hasthe properties of modularity, simplicity, regular interconnec-tion, and is easy for VLSI implementation. The proposed ver-satile multiplier can be efficiently used in public-key cryp-tosystems, such as elliptic curve cryptography; and the dig-ital signal processing, for example, the Reed-Solomon en-coder/decoder.

The outline of the remainder of the paper is as follows.In Section 2, we briefly review the normal basis representa-tion and Massey-Omura multiplier. Section 3 contains thederivation of the pipeline versatile normal basis multiplier inGF(2n) and comparison with previous works. Section 4 con-cludes with the improved result and a description of areas ofapplications.

Low-Complexity Versatile Finite Field Multiplier in Normal Basis 955

2. MULTIPLICATION ON GF(2n)

It has been proved that there always exists a normal basis [12]for a given finite field GF(2n) which is the form of

N ={β, β2, β22

, . . . , β2n−1}, (1)

where β is a root of the irreducible polynomial P(x) of degreen over GF(2) and n elements of the set are linearly indepen-dent.

We say that β generates the normal basis N , or β is a nor-mal element of GF(2n). Every element a ∈ GF(2n) can berepresented by a =∑n−1

i=0 aiβ2i , where ai ∈ {0, 1}.

The following properties [10] of a finite field GF(2n) areuseful in the applications.

(1) Squaring is a linear operation, that is, given any twoelements a and b in GF(2n),

(a + b)2 = a2 + b2. (2)

(2) For any element a ∈ GF(2n),

a2n = a. (3)

(3) For any element a ∈ GF(2n),

1 = a + a2 + a4 + · · · + a2n−1. (4)

This implies that the normal basis representation of 1 is(1, 1, . . . , 1).

(4) Squaring an element a in the normal basis represen-tation is a cyclic shift operation, that is,

a2 =n−1∑i=0

aiβ2i+1

=n−1∑i=0

ai−1β2i

= (an−1, a0, . . . , an−2)

(5)

with indices reduced modulo n.Let a and b be two arbitrary elements in GF(2n) in a nor-

mal basis representation and c = a·b be the product of a andb. We denote a =∑n−1

i=0 aiβ2i as a vector a = (a0, a1, . . . , an−1),

b = ∑n−1i=0 biβ

2i as a vector b = (b0, b1, . . . , bn−1), and c =∑n−1i=0 ciβ

2i as a vector c = (c0, c1, . . . , cn−1), then the last termcn−1 of c is a logic function of the components of a and b,that is,

cn−1 = f(a0, a1, . . . , an−1; b0, b1, . . . , bn−1

). (6)

Since squaring in normal representation is a cyclic shift of theelement, we have c2 = a2 · b2 or equivalently

(cn−1, c0, c1, . . . , cn−2

)= (an−1, a0, a1, . . . , an−2

) · (bn−1, b0, b1, . . . , bn−2).

(7)

Hence, the last component cn−2 of c2 can be obtained bythe same function f operating on the components of a2 andb2. That is,

cn−2 = f(an−1, a0, a1, . . . , an−2; bn−1, b0, b1, . . . , bn−2

). (8)

By squaring c repeatedly, we get

cn−1 = f(a0, a1, . . . , an−1; b0, b1, . . . , bn−1

),

cn−2 = f(an−1, a0, a1, . . . , an−2; bn−1, b0, b1, . . . , bn−2

),

...

c0 = f(a1, a2, . . . , an−1, a0; b1, b2, . . . , bn−1, b0

).

(9)

Equations 9 define the Massey-Omura multiplier in nor-mal basis representation [10]. In Massey-Omura multiplier,the same logic function f for computing the last compo-nent of cn−1 of the product c can be used to get the remain-ing components cn−2, cn−3, . . . , c0 of the product sequen-tially. In parallel architecture, we can use n identical logicfunction f for calculating all components of the productsimultaneously.

3. A PIPELINE ARCHITECTURE FOR THE SERIALVERSATILE NORMAL BASIS MULTIPLIER

In this section, we derive a pipeline architecture to imple-ment the versatile normal basis multiplier. Let c be the prod-uct of a and b,

c =n−1∑i=0

n−1∑j=0

aibjβ2i β2 j . (10)

In the normal basis, we have

β2i β2 j =n−1∑k=0

λ(k)i j β

2k , λ(k)i j ∈ GF(2). (11)

Thus, we can get

ck =n−1∑i=0

n−1∑j=0

λ(k)i j aibj , 0 ≤ k ≤ n− 1. (12)

From the above analysis, we see that the important issuefor building a versatile normal basis multiplier is to get the

value of λ(k)i, j for different irreducible polynomials. The n× n

matrices λ(k) (0 ≤ k ≤ n − 1) whose elements is λ(k)i, j (0 ≤ i,

j ≤ n − 1) can be obtained if we know the transformationbetween the elements of the canonical basis and the elementsof the normal basis, that is, the normal basis representationof the elements of the canonical basis.

In the following, we define the multiplication table of thenormal basis and use the basis element transformation for-mula to get the values of the multiplication table, and thenobtain the n × n matrices λ(k). Finally, we illustrate the ap-proach to build the versatile pipeline normal basis multiplier.


Definition 1. Let N = {β, β2, . . . , β2n−1} be a normal basis inGF(2n), then for any i, j (0 ≤ i, j ≤ n − 1), β2i β2 j is a linearcombination of β, β2, . . . , β2n−1

with coefficients in GF(2). Inparticular,

β

β

β2

...

β2n−1

= T

β

β2

...

β2n−1

, (13)

where T is an n×nmatrix over GF(2). We call T the multipli-cation table of the normal basis N . The number of nonzeroentries in T is called the complexity of the normal basis N ,denoted by CN .

There always exists the multiplication table T and thematrix λ(k) for a given irreducible polynomial which definesthe normal basis in GF(2n) [12]. After the multiplication ta-ble T is obtained, the matrix λ(k) can be calculated accordingto (12). An example is shown below.

Example 1. Let the irreducible polynomial be P1(x) = x5 +x4 + x2 + x + 1 and β be a root of the polynomial, thenthe canonical basis is {1, β, β2, β3, β4} and the normal basisis {β, β2, β4, β8, β16}. We can get the following normal basisrepresentation for the elements of the canonical basis:

1 = β + β2 + β4 + β8 + β16,

β = β, β2 = β2,

β3 = β + β8, β4 = β4.

(14)

The appendix illustrates how to obtain the normal basisrepresentation of β3.

Thus the element βi (i > 5) can be reduced to the repre-sentation of canonical basis and converted to the correspond-ing representation of normal basis by the base element trans-formation formula (14). For instance,

β17 = 1 + β2 + β3

= 1 + β2 +(β + β8)

= β16 + β4.

(15)

Then we can get the multiplication table T for givenP1(x) which is

T =

0 1 0 0 01 0 0 1 00 0 0 1 10 1 1 0 00 0 1 0 1

,

β

β

β2

β4

β8

β16

= T

β

β2

β4

β8

β16

.

(16)

The product of a and b is

c = ab

= c0β + c1β2 + c2β

4 + c3β8 + c4β

16

= (a0β + a1β2 + a2β

4 + a3β8 + a4β

16)× (b0β + b1β

2 + b2β4 + b3β

8 + b4β16)

= a0b0β2 + a0b1β

3 + a0b2β5 + a0b3β

9 + a0b4β17

+ a1b0β3 + a1b1β

4 + a1b2β6 + a1b3β

10 + a1b4β18

+ a2b0β5 + a2b1β

6 + a2b2β8 + a2b3β

12 + a2b4β20

+ a3b0β9 + a3b1β

10 + a3b2β12 + a3b3β

16 + a3b4β24

+ a4b0β17 + a4b1β

18 + a4b2β20 + a4b3β

24 + a4b4β32.(17)

As β6 = (β3)2, β10 = (β5)2, β18 = (β9)2, β12 = (β6)2,β20 = (β5)4, β24 = (β3)8, β32 = β, we can easily obtain theseelements’ normal basis representation by cost-free cyclic shiftoperation on the row of the multiplication table T and getthe matrix λ(4) which leads to the function f to compute thecoefficient of c4

λ(4) =

0 0 1 0 1

0 0 1 1 0

1 1 0 0 0

0 1 0 1 0

1 0 0 0 0

. (18)

It can be readily seen that the matrices λ(k) (0 ≤ k ≤ n−1)are symmetric.

From the matrix λ(4), we can get the following logic func-tion to compute the most significant bit of the product of abin GF(25) defined on the irreducible polynomial P1(x)

c4 =4∑i=0

4∑j=0

λ(4)i j aibj

= a0b2 + a2b0 + a0b4 + a4b0 + a1b2

+ a2b1 + a1b3 + a3b1 + a3b3.

(19)

In the normal basis representation, the logic functionf = (a0, a1, . . . , an−1; b0, b1, . . . , bn−1) which is used to get themost significant bit (cn−1) of the product can also be used toget the remaining bits (cn−2, cn−3, . . . , c0) of the product, ex-cept we cyclically shift the input of the function [10]. Thus,we may choose one matrix from the matrices λ(k) (0 ≤ k ≤n−1) and input the values of upper triangle of the symmetricmatrix for doing the multiplication.

A VLSI array architecture to implement the versatileGF(2n) normal basis multiplier is proposed and illustrated inFigures 1 and 2. The basic cells in the structure are 3-input


a

b Buffer

XOR 2-input XOR gate

b0 b1 b2 bn−2 bn−1

XOR XOR XOR

AND AND AND AND AND

an−1

an−2

a1

a0

λn−1,0 λn−1,1 λn−1,2 λn−1,n−2 λn−1,n−1

XOR XOR XOR

AND AND AND AND AND

λn−2,0 λn−2,1 λn−2,2 λn−2,n−2 λn−2,n−1

XOR XOR XOR

AND AND AND AND AND

λ1,0 λ1,1 λ1,2 λ1,n−2 λ1,n−1

XOR XOR XOR

AND AND AND AND AND

λ0,0 λ0,1 λ0,2 λ0,n−2 λ0,n−1

Bu

ffer

· · ·

· · ·

· · ·

· · ·

· · ·

......

......

......

Figure 1: The logic circuit of the AND gate plane in the versatile multiplier.

a

b

ci

AND gate array

XOR gate tree

Figure 2: The architecture of the serial versatile normal basisGF(2n) multiplier.

AND gates and 2-input XOR gates. We use the 3-input

AND gates to compute aibjλ(n−1)i, j in the X-Y dimension,

and compute the sum of aibjλ(n−1)i, j by a binary tree struc-

ture of 2-input XOR gates in the Z dimension. The archi-tecture requires n2 3-input AND gates and n2 − 1 2-inputXOR gates, the time delay for generating one bit of the prod-uct is TAND3 + 2( log2 n�)TXOR, where TAND3 is the time de-lay of a 3-input AND gate and TXOR is the time delay of a2-input XOR gate. We can get all bits of the product by cycli-cally shifting the input coefficients of a and b. As the irre-ducible polynomial is not changed frequently as the mul-tiplicands, we can store the elements of the matrix λ(n−1)

in the registers once the irreducible polynomial has beendecided.

The algorithm for this multiplication can be described asfollows.


ci

XOR

XOR

XOR XOR

XOR XOR XOR

an−1 · · · a1a0

b0 AND b1 AND b2 AND bn−2 AND bn−1 AND

λn−1,0

...λ1,0

λ0,0

λn−1,1

...λ1,1

λ0,1

λn−1,2

...λ1,2

λ0,2

λn−1,n−2

...λ1,n−2

λ0,n−2

λn−1,n−1

...λ1,n−1

λ0,n−1

Figure 3: A low-cost architecture of the serial versatile normal basis GF(2n) multiplier.

Algorithm 1 (versatile normal basis multiplication in GF(2n)).

Input: Coefficients of a, b, and the matrix of λ(n−1).Output: c = ab.Beginload matrix λ(n−1).for k = n− 1 to 0 do

begin

ck =∑n−1

i=0

∑n−1j=0 aibjλ

(n−1)i j ;

cyclic shift the coefficients of a and b;end;

End.

The proposed architecture can be implemented by apipeline structure. In the first n clock cycles, the coefficientsof a and b are fed sequentially into the buffers. In the fol-lowing n clock cycles, we will get the result of the productby cyclically shifting the registers which store the original co-efficients of a and b. In the meantime, the next two multi-plicands can be fed into the buffers during these clock cyclesand we can compute the second product immediately just af-ter we finish the first one.

In the restricted computing environment, we can iter-ate using one level components of the proposed multiplier(Figure 2) to obtain a low-cost serial architecture as illus-trated in Figure 3 to implement the same computation. It canbe described by the following algorithm.

Algorithm 2 (low-cost serial versatile normal basis multipli-cation in GF(2n)).Input: Coefficients of a, b, and the matrix of λ(n−1).

Output: c = ab.Beginfor k = n− 1 to 0 do

beginc0k = 0;

for i = 0 to n− 1ci+1k = cik +

∑n−1j=0 aibjλ

(n−1)i j ;

cyclic shift the coefficients of a and b;end;

End.

The low-cost versatile normal basis multiplier in GF(2n)requires n 3-input AND gates and n 2-input XOR gates. Thetime delay for generating one bit of the product is n(TAND3 +( log2 n� + 1)TXOR).

The proposed versatile normal basis multipliers havemodular structures, regular interconnections which are suit-able for high speed or restricted space of VLSI implemen-tations. Table 1 lists the comparison of space and time com-plexity between our new multipliers and previous works. Theinput ports of the proposed versatile multiplier are almostthe same as the nonversatile multiplier, since the finite fieldparameters can be configured into the multiplier by the inputports of multiplicands (a and b) through a one-bit controlsignal at the configuration time. The finite field parametersdo not need reconfiguration during the running time of themultiplier, until the application environments are changed.Thus the hardware cost can be greatly reduced compared tothe nonversatile multiplier where a new multiplier has to beredesigned and implemented when the finite field parame-ters are required to be changed.


Table 1: Comparison of versatile multipliers with nonversatile multipliers in GF(2n).

Multiplier Type # XOR Gates # AND Gates Time Delay

Wang-MOM [10] Nonversatile 2n− 2 2n− 1 n(TAND + ( log2 n� + 1)TXOR)

Li-CVM [9] (canonical basis) Versatile 2n2 2n2 n(TAND + 2TXOR)

Prop. multiplier (Figure 2) Versatile n2 − 1 n2 (3-input) n(TAND3 + 2 log2 n�TXOR)

Prop. low-cost multiplier (Figure 3) Versatile n n (3-input) n2(TAND3 + ( log2 n� + 1)TXOR)

Moreover, the proposed architecture in GF(2n) can beeasily expanded to the finite field of GF(22n). The one solu-tion is to use two basic GF(2n) architecture to implement themultiplication in GF(22n) and another alternative solution isto do the GF(22n) multiplication serially by using only onebasic GF(2n) architecture.

4. CONCLUSION

In this paper, the architectures for finite field multiplicationbased on normal basis have been proposed. The architec-tures require simple control signals and have regular local in-terconnections. As a consequence, they are very suitable forVLSI implementation. The versatile property of this VLSI ar-ray modular multiplier increases the application range andthe same multiplier can be applied for different applicationenvironments, such as elliptic curve cryptosystems and Reed-Solomon encoder/decoder. The proposed multiplier can beeasily extended to high order of n for more security. More-over, the structures can be modified to make fast exponen-tiation and inversion. Also note that we can make a low-cost and space efficient serial multiplier which is feasiblein the restricted computing environments and embeddedsystems.

APPENDIX

Let the irreducible polynomial be P1(x) = x5 + x4 + x2 +x + 1 and let β be a root of the polynomial. We show theprocedures of computing the multiplication table T and thematrix λ(4).

As β is a root of the P1(x),

β5 = β4 + β2 + β + 1, (A.1)

β6 = β5β

= β5 + β3 + β2 + β

= β4 + β2 + β + 1 + β3 + β2 + β

= β4 + β3 + 1.

(A.2)

We multiply β2 to both sides of (A.2), and get

β8 = β6 + β5 + β2. (A.3)

From (A.3),

β6 = β8 + β5 + β2. (A.4)

As

1 = β16 + β8 + β4 + β2 + β. (A.5)

Substitute (A.5) into (A.1),

β5 = β4 + β2 + β + β16 + β8 + β4 + β2 + β

= β16 + β8.(A.6)

Substitute (A.6) into (A.4),

β6 = β8 + β5 + β2

= β8 + β16 + β8 + β2

= β16 + β2.

(A.7)

From (A.2), we get

β3 = β6 + β4 + 1. (A.8)

Substitute (A.7) and (A.5) into (A.8),

β3 = β16 + β2 + β4 + β16 + β8 + β4 + β2 + β

= β8 + β.(A.9)

REFERENCES

[1] E. R. Berlekamp, “Bit-serial Reed-Solomon encoders,” IEEETransactions on Information Theory, vol. 28, no. 6, pp. 869–874, 1982.

[2] I. S. Hsu, T. K. Truong, L. J. Deutsch, and I. S. Reed, “A com-parison of VLSI architecture of finite field multipliers usingdual, normal, or standard bases,” IEEE Trans. on Computers,vol. 37, no. 6, pp. 735–739, 1988.

[3] J. L. Massey and J. K. Omura, “Computational method andapparatus for finite field arithmetic,” U.S. Patent application,1981.

[4] B. A. Laws Jr. and C. K. Rushforth, “A cellular-array multiplierfor GF(2m),” IEEE Trans. on Computers, vol. 20, no. 12, pp.1573–1578, 1971.

[5] P. A. Scott, S. E. Tarvares, and L. E. Peppard, “A fast VLSI mul-tiplier for GF(2m),” IEEE Journal on Selected Areas in Commu-nications, vol. 4, pp. 62–66, January 1986.

[6] L. Song and K. Parhi, “Low-energy digit-serial/parallel finitefield multipliers,” Journal of VLSI Signal Processing, vol. 19,no. 2, pp. 149–166, 1998.

[7] S. K. Jain, L. Song, and K. K. Parhi, “Efficient semisystolicarchitectures for finite-field arithmetic,” IEEE Trans. on VLSISystems, vol. 6, no. 1, pp. 101–113, 1998.

[8] M. A. Hasan and A. G. Wassal, “VLSI algorithms, architec-tures and implementation of a versatile GF(2m) processor,”IEEE Trans. on Computers, vol. 49, no. 10, pp. 1064–1073,2000.


[9] H. Li and C. N. Zhang, “Efficient cellular automata basedversatile modular multiplier forGF(2m),” to appear in Journalof Information Science and Engineering.

[10] C. C. Wang, T. K. Truong, H. M. Shao, L. J. Deutsch, J. K.Omura, and I. S. Reed, “VLSI architectures for computingmultiplications and inverses inGF(2m),” IEEE Trans. on Com-puters, vol. 34, no. 8, pp. 709–716, 1985.

[11] G. Feng, “A VLSI architecture for fast inversion in GF(2m),”IEEE Trans. on Computers, vol. 38, no. 10, pp. 1383–1386,1989.

[12] A. J. Menezes, Applications of Finite Fields, Kluwer AcademicPublishers, Boston, Mass, USA, 1993.

Hua Li received his B.E. and M.S. degreesfrom Beijing Polytechnic University andPeking University. He is a Ph.D. candidate inthe Department of Computer Science, Uni-versity of Regina. Currently, he works as anassistant professor at Department of Math-ematics and Computer Science, Universityof Lethbridge, Canada. His research inter-ests include parallel systems, reconfigurablecomputing, fault-tolerant, VLSI design, andinformation and network security. He is a member of IEEE.

Chang Nian Zhang received his B.S. degreein applied mathematics from University ofScience Technology, China, and the Ph.D.degree in computer science and engineer-ing from Southern Methodist University. In1998, he joined Concordia University as aresearch assistant professor in Departmentof Computer Science. Since 1990, he hasbeen with University of Regina, Canada,in Department of Computer Science. Cur-rently he is a full professor and leads a research group in parallelprocessing, data security, and neural networks.


A Reduced-Complexity Fast Algorithm for SoftwareImplementation of the IFFT/FFT in DMT Systems

Tsun-Shan ChanVXIS Technology Corporation, Hsin-chu, Taiwan, ROCEmail: [email protected]

Jen-Chih KuoDepartment of Electrical Engineering, Graduate Institute of Electronics Engineering, National Taiwan University,Taipei 106, Taiwan, ROCEmail: [email protected]

An-Yeu (Andy) WuDepartment of Electrical Engineering, Graduate Institute of Electronics Engineering, National Taiwan University,Taipei 106, Taiwan, ROCEmail: [email protected]


The discrete multitone (DMT) modulation/demodulation scheme is the standard transmission technique in the application ofasymmetric digital subscriber lines (ADSL) and very-high-speed digital subscriber lines (VDSL). Although the DMT can achievehigher data rate compared with other modulation/demodulation schemes, its computational complexity is too high for cost-efficient implementations. For example, it requires 512-point IFFT/FFT as the modulation/demodulation kernel in the ADSLsystems and even higher in the VDSL systems. The large block size results in heavy computational load in running programmabledigital signal processors (DSPs). In this paper, we derive computationally efficient fast algorithm for the IFFT/FFT. The proposedalgorithm can avoid complex-domain operations that are inevitable in conventional IFFT/FFT computation. The resulting soft-ware function requires less computational complexity. We show that it acquires only 17% number of multiplications to computethe IFFT and FFT compared with the Cooly-Tukey algorithm. Hence, the proposed fast algorithm is very suitable for firmwaredevelopment in reducing the MIPS count in programmable DSPs.

Keywords and phrases: FFT, IFFT, DMT, software implementation.

1. INTRODUCTION

Recent progress of Internet access has a strong demand onhigh-speed data transmission. To overcome the transmissionbottleneck over the conventional twisted-pair telephonelines, several sophisticated modulation/demodulationschemes have been proposed, including carrierless-amplitude-phase (CAP) modulation [1], discrete multitonemodulation (DMT) [2, 3, 4, 5] and QAM technology [6].Among these advanced modulation schemes, the DMT canachieve highest transmission rate since it incorporates lotsof advanced DSP techniques such as dynamic bit allocation,multidimensional tone encoding, frequency-domain equal-ization, and so forth. As a consequence, the DMT has beenchosen as the physical layer transmission standard by theADSL standardization committee.

One major disadvantage of the DMT scheme is its high

computational complexity. In particular, the large blocksize of the IFFT/FFT consumes lots of computing powerin running programmable DSPs [7]. In [8], we have con-sidered a cost-efficient lattice VLSI architecture to realizethe IFFT/FFT in integrated circuits. In this paper, we pro-pose computationally efficient fast algorithms to run theIFFT/FFT function in software implementation such as pro-grammable DSP processors (DSPs). By making use of thesymmetric/antisymmetric properties of the Fourier trans-form, we first decompose the IFFT/FFT into a combinationof two new real-domain transform kernels—the ModifiedDCT and Modified DST. These two transform functions areused to replace the complex-domain IFFT/FFT. Then we em-ploy the divide-and-conquer approach in [9] to derive novelrecursive algorithms and butterfly architectures for the mod-ified DCT DST.


Modulator

X(0)

X(1)

X(2)

X(0)

X(1)

X(2)

Encodedcomplexsymbols

(from encoder)

X(N − 1)X(N)

2N-p

oin

tIF

FT

Para

llel/

Seri

al

Con

juga

te

X(2N−2)

X(2N−1)

y(n)Channel

y(n)

Demodulator

X(0)

X(1)

X(2)

X(0)

X(1)

X(2)

Seri

al/P

aral

lel

2N-p

oin

tFF

T X(N−1)

X(2N−2)

X(2N−1)

Dis

card

Demodulatedcomplexsymbols

(to decoder)

...

......

......

Figure 1: The IFFT/FFT block diagram in the DMT system.

The new scheme can avoid redundant complex-domainof the IFFT/FFT. That is, it involves only real-valued opera-tions to compute the IFFT/FFT. Hence, we can avoid the spe-cial data structure in software programming to run complex-domain addition/multiplication operations in computingthe IFFT/FFT. In addition, our analysis shows that we needonly 17% and multiplications in computing the IFFT andFFT compared with Cooly-Tukey algorithm [10]. The lowcomputational complexity as well as real-domain operationsmakes it very suitable for firmware coding in DSPs, whichhelps to save the MIPS counts. Also, the DSP program canbe written in recursive form which requires less ROM/RAMprogram storage space to implement the IFFT/FFT.

The rest of this paper is organized as follows. Section 2shows the derivation of the IFFT algorithm. In Section 3, thederivation of the FFT algorithm is discussed. The computa-tion complexity comparison is shown in Section 4. The finiteprecision effect of our algorithm is also discussed. Finally, weconclude our work in Section 5.

2. REDUCED-COMPLEXITY IFFT ALGORITHM

2.1. The IFFT derivation

The IFFT/FFT block diagram in the DMT system is showedin Figure 1. At the transmitter side, to ensure the IFFT gen-erates only real-valued outputs, the inputs of the IFFT in theDMT standard have the constraint [11],

X(0) = X(N) = 0,

X(k) = X∗(2N − k) for k = 1, 2, . . . , N − 1,(1)

where X(k)�= Xr(k) + j · Xi(k) are encoded complex sym-

bols. As defined in [12, Chapter 9], the IFFT of a finite-lengthsequence of length 2N is

x(n) = 12N

·[ 2N−1∑

k=0

X(k)W−nk2N

], for n = 0, 1, . . . , 2N − 1,

(2)

where

Wnk2N

�= exp(− j

2πnk2N

)= cos

2πnk2N

− j sin2πnk2N

. (3)

By decomposing n into the first half and the second half, (2)becomes

x(n) = 12N

·[ N−1∑

k=0

X(k)W−nk2N +

2N−1∑k=N

X(k)W−nk2N

]. (4)

Next, by substituting (3) into (4), and using (1), we can sim-plify (4) as (see Appendix A)

x(n) = 1N·[ N−1∑

k=0

Xr(k) cos2πnk2N

−N−1∑k=0

Xi(k) sin2πnk2N

]

= 1N· [MDCT(n)−MDST(n)

],

for n = 0, 1, . . . , 2N − 1.(5)

A Reduced-Complexity Fast Algorithm for Software Implementation of the IFFT/FFT in DMT Systems 963

Xr(0)

Xr(1)

...

Xr(N−2)

Xr(N−1)

Even-oddindex

mapping

Xr(0)

Xr(2)

...

Xr(N − 4)

Xr(N − 2)

Xr(1)

Xr(3)...

Xr(N − 5)

Xr(N − 3)

Xr(N − 1)

+

...

+

+

N/2-point

MDCTg(n)

N/2-point

MDCTh′′(n)

xr(N−1)

−xr(N−1)

+

+...

xr(N−1)

−xr(N−1)+

+

Γ0

Γ1

...

ΓN/2−2

ΓN/2−1

...

+

+

...

+

+

+

+

+

...

+

+

+

+

−

−

−

MDCT (0)

MDCT (1)

...

MDCT (N/2 − 2)

MDCT (N/2 − 1)

Special case

MDCT (N/2)

MDCT (N/2 + 1)

MDCT (N/2 + 2)

...

MDCT (N − 1)

Oddsummation

Injecteditems

Γn = 1/2Cn2Nn : 0 ∼ N/2 − 1

N-point MDCT

Xr(k) 1-pointMDCT

MDCT (n) Xr(k) MDCT (n)

Figure 2: N-point MDCT(n) butterfly structure, where 1-point MDCT is the minimum-sized processing block.

From (5), we can see that the computation of the IFFTis decomposed into two real-valued operations. One is adiscrete cosine transform DCT-like operation with Xr(k),k = 0, 1, 2, . . . , N − 1, as the inputs. The other is a dis-crete sine transform DST-like operation with Xi(k), k =0, 1, 2, . . . , N − 1, as the inputs. We will name the firstterm Modified DCT (MDCT), and the second term Modi-fied DST (MDST). Note that the MDCT and MDST involveonly real-valued operators. Furthermore, it can be shownthat

MDCT(n) = MDCT(2N − n), for n = 0, 1, . . . , N − 1,(6)

MDST(n) = −MDST(2N − n), for n = 0, 1, . . . , N − 1.(7)

Hence, we can focus on computing MDCT(n) and MDST(n)for n = 0, 1, . . . , N − 1. Then, expand the results for n = N +1, N +2, . . . , 2N−1. For the special cases of n = 0 and n = N ,the MDCT and MDST can be simplified as

MDCT(0) =N−1∑k=0

Xr(k) cos2π0k2N

=N−1∑k=0

Xr(k),

MDST(0) =N−1∑k=0

Xi(k) sin2π0k2N

= 0,

MDCT(N) =N−1∑k=0

Xr(k) cos2πNk

2N=

N−1∑k=0

Xr(k)(−1)k,

MDST(N) =N−1∑k=0

Xi(k) sin2πNk

2N= 0,

(8)

respectively. These simple relationships can help us to saveadditional computation complexity.

2.2. MDCT/MDST operations of the IFFT

From the preceding discussion, we can see that the imple-mentation issue of the IFFT is to realize MDCT and MDSTin a cost-efficient way. Then, we can just combine the re-sults of the MDCT and MDST to obtain the IFFT resultsbased on (5). Here, we first consider the implementation


of the MDCT. We follow the derivation in [9] and define

Cnk2N�= cos(2πnk/2N). Then, the MDCT can be written as

MDCT(n) =N−1∑k=0

Xr(k)Cnk2N , for n = 0, 1, . . . , N − 1. (9)

Decompose the MDCT into even and odd indices of k, then(9) can be rewritten as

MDCT(n) = g(n) + h′(n), for n = 0, 1, . . . ,N

2− 1, (10)

where

g(n)�=

N/2−1∑k=0

Xr(2k)Cn(2k)2N =

N/2−1∑k=0

Xr(2k)CnkN ,

h′(n)�=

N/2−1∑k=0

Xr(2k + 1)Cn(2k+1)2N .

(11)

Define h(n)�= 2Cn2Nh

′(n). Following the derivation in Lee’salgorithm [9], we can find

MDCT(n) = g(n) + h′(n) = g(n) +1

2Cn2Nh(n). (12)

That is,

N−1∑k=0

Xr(k)Cnk2N︸︷︷︸N-point MDCT

=N/2−1∑k=0

Xr(2k)CnkN︸︷︷︸N/2-point MDCT, g(n)

+1

2Cn2N

N/2−1∑

k=0

[Xr(2k + 1) + Xr(2k − 1)

]CnkN︸︷︷︸

N/2-point MDCT, h′′(n)

+ Xr(N − 1)(−1)n︸︷︷︸injected item

,

for n = 0, 1, . . . ,N

2− 1.

(13)

On the other hand, by replacing index nwith (N−n) in (12),it can be shown that

MDCT(N − n) = g(n)− h′(n) = g(n)− 12Cn2N

h(n). (14)

The special case MDCT(N/2) needs to be computed sepa-rately, which can be simplified as

MDCT(N

2

)=

N−1∑k=0

Xr(k)Ck(N/2)2N =

N−1∑k=0

Xr(k) coskπ

2. (15)

The mapping of (13), (14), and (15) is shown in Figure 2. Aswe can see, theN-point MDCT is decomposed into twoN/2-

point MDCT (g(n) and h′′(n)) plus some pre-processing andpost-processing modules. Then we can apply the techniqueof divide-and-conquer to recursively expand the N/2-pointMDCT until 1-point MDCT is formed. That is, we repeatthe decomposition in (10) and (11) until N = 1.

Next, we consider the recursive implementation of the

MDST. We define Snk2N�= sin (2πnk/2N). As with the deriva-

tion in (10), (11), (12), (13), and (14), we can find

MDST(n) =N/2−1∑k=0

Xi(2k)SnkN

+1

2Cn2N

N/2−1∑k=0

[Xi(2k + 1) + Xi(2k − 1)

]SnkN ,

MDST(N − n) = −N/2−1∑k=0

Xi(2k)SnkN

+1

2Cn2N

N/2−1∑k=0

[Xi(2k + 1) + Xi(2k − 1)

]SnkN ,

for n = 0, 1, . . . ,N

2− 1.

(16)

It is worth noting that the injected item is zero in the MDST.Besides, the MDST also has a special case for index N/2 as

MDST(N

2

)=

N−1∑k=0

Xi(k)Sk(N/2)2N =

N−1∑k=0

Xi(k) sinkπ

2. (17)

The mapping of the MDST structure in Figure 3 is similar tothe MDCT structure, except that minimum processing blockis 2-point MDST (see Figure 3) and the injected items do notexist in the MDST implementation. That is, we repeat the de-composition in (16) untilN = 2. Note that the 1-point MDSTis always equal to zero.

2.3. Overall IFFT computation procedures

The overall IFFT computation flow is shown in Figure 4.It consists of the MDCT/MDST operations and a post-processing operation. The operations in Figure 4 are as fol-lows:

(1) set the butterfly operation to MDCT mode;(2) Xr(k), k = 0, 1, . . . , N − 1, are first fed into the but-

terfly architecture to obtain the MDCT(n), for n =0, 1, . . . , N − 1;

(3) the post-processing operation expands the N-pointMDCT outputs to 2N-point MDCT using the sym-metric property in (6);

(4) set the butterfly operation to MDST mode;(5) repeat the computation in Steps 2 and 3 using Xi(k),

k = 0, 1, . . . , N−1 as inputs, and obtain the MDST(n),for n = 0, 1, . . . , N − 1;

(6) the post-processing operation expands the N-pointMDST outputs to 2N-point MDST by using the an-tisymmetric property in (7);


Xi(0)

Xi(1)

...

Xi(N−2)

Xi(N−1)

Even-oddindex

mapping

Xi(0)

Xr(2)

...

Xi(N − 4)

Xi(N − 2)

Xi(1)

Xi(3)...

Xi(N − 5)

Xi(N − 3)

Xi(N − 1)

+

+

+

...

N/2-point

MDSTg(n)

N/2-point

MDSTh′′(n)

...

...

Γ0

Γ1

ΓN/2−2

ΓN/2−1

...

+

+...

+

+

−

−

−

+

+

+

+

+

+

+

...

MDST (0)

MDST (1)

...

MDST (N/2 − 2)

MDST (N/2 − 1)

Special case

MDST (N/2)

MDST (N/2 + 1)

MDST (N/2 + 2)

...

MDST (N − 1)

Oddsummation

Γn = 1/2Cn2Nn : 0 ∼ N/2 − 1

N-point MDST

Xi(k0)

Xi(k1)

2-pointMDST

MDST (n0)

MDST (n1)

Xi(k0)

Xi(k1)

MDST (n0)

MDST (n1)

Figure 3: N-point MDST(n) butterfly structure, where 2-point MDST is the minimum-sized processing block.

(7) based on (5), we combine the MDCT and MDST re-sults together with the scaling operation (which isachieved by shifting right by log2(N) bits) to obtainthe IFFT results. This is done in the post-processingoperation.

2.4. Matrix notation of the MDCT/MDST

In this section, we present the matrix notation of the pro-posed fast IFFT algorithm. The matrix form can help to seethe divide-and-conquer nature of our approach. By follow-ing the notation in [13], we rewrite Xr(k) and MDCT(n) as

[Xr(k)N

]�=[Xr(0) Xr(1) · · · Xr(N − 1)]T, (18)[

MDCT(n)N]

�=[

MDCT(0) MDCT(1) · · · MDCT(N − 1)]T,

(19)

respectively. Then (9) can be represented as

[MDCT(n)N

] = [TN,MDCT][Xr(k)N

], (20)

where [TN,MDCT] denotes the transform kernel matrix of theMDCT operation. Next, the injected items of (13) can be

represented as

Injected = [ON/2]Xr(N − 1), (21)

where

[ON/2

] = [1 −1 1 −1 1 · · · −1]T. (22)

We define the odd-summation matrix as

[LN/2

] =

1 0 0 · · · 0 01 1 0 · · · 0 00 1 1 · · · 0 0...

......

. . ....

...0 0 0 · · · 1 1

(23)

and the scaling matrix as

[ΦN/2

] = diag{

12Cn2N

}, for n = 0, 1, . . . ,

N

2− 1. (24)

The special case of the MDCT in (15) can be represented as

MDCT(N

2

)= [SN][Xr(k)N

], (25)


Xi(0)

Xi(1)

...

Xi(N − 2)

Xi(N − 1)

Phase IIfor

MDST

Xr(0)

Xr(1)

...

Xr(N − 2)

Xr(N − 1)

Phase Ifor

MDCT

...

N-pointMDCT/MDST

+

+

...

+...

+

+...

+

+

R

R

R

R

R

R

R

�

�

�

�

�

�

�

�

x(0)

x(1)

...

x(N/2)

...

x(N − 1)

Special casex(N)

x(N + 1)

...

x(3N/2)

...

x(2N − 1)

Shift left log2(N) bits

Post-processing(Expanding circuit)

Figure 4: The proposed IFFT architecture.

where

[SN] = [1 0 −1 0 1 · · · 0

]. (26)

Based on (12), (13), (14), (21), (22), (23), (24), (25), and(26), the [TN,MDCT] can be expressed in the matrix form as

[TN,MDCT

] =

[TN/2

] [ψN,MDCT

][TN/2

][JN/2

] [ψN,MDCT

][JN/2

] , (27)

where [ψN,MDCT] = [ΦN/2]([LN/2][TN/2] + [ON/2])‖[SN ] and[JN/2] denotes the opposite-diagonal identity matrix. We canalso represent (20) and (27) in the recursive form as shownin Figure 5. Following the above derivations, the matrix no-tation of transform kernel of the MDST can be derived as

[TN,MDST

] =[ [

TN/2] [

ψN,MDST]

−[TN/2][JN/2] [ψN,MDST

][JN/2

]], (28)

where [ψN,MDST] = [ΦN/2][LN/2][TN/2][SN ]. Note that theMDST is similar to the MDCT except that there is no in-jected items. Also, the special case matrix can be modifiedas

[SN] = [0 1 0 −1 0 · · · −1

]. (29)

The block diagram of the MDST in the matrix form isshown in Figure 6.

3. REDUCED-COMPLEXITY FFT ALGORITHM

3.1. The FFT derivation

At the receiver side (see Figure 1), the 512-point FFT is usedto demodulate the received signals, which is given by

X(k) =2N−1∑n=0

x(n)Wnk2N , for k = 0, 1, . . . , 2N − 1, (30)

where

Wnk2N

�= exp(− j

2πnk2N

)= cos

2πnk2N

− j · sin2πnk2N

. (31)

Note that x(n), n = 0, 1, . . . , 2N−1, are real-valued numbers.Hence, (30) can be rewritten as

X(k)=2N−1∑n=0

x(n) cos2πnk2N

− j2N−1∑n=0

x(n) sin2πnk2N

=MDCT(k)− j ·MDST(k), for k = 0, 1, . . . , 2N − 1.(32)

Equation (32) shows that the computation of the FFT is


Xr(0)

Xr(1)

...

Xr(N − 2)

Xr(N − 1)

Even-oddindex

mapping

Even

Odd

[TN/2]

[TN/2][LN/2]

Xr(N − 1) [QN/2]

[SN ]

+ [ΦN/2]

++

+

++

−[JN/2]

MDCT (n)N

Figure 5: Block diagram of the MDCT operation in matrix form.

Xi(0)Xi(1)

...

Xi(N − 2)

Xi(N − 1)

Even-oddindex

mapping

Even

Odd

[TN/2]

[TN/2][LN/2]

[SN ]

[ΦN/2]

+ +

+

++− [JN/2]

MDST (n)N

Figure 6: Block diagram of the MDCT operation in matrix form.

decomposed into a combination of two real-domainkernels—MDCT(k) and MDST(k). Both MDCT and MDSTuse x(n), n = 0, 1, . . . , 2N − 1, as the inputs. Hence, weonly employ two real-valued kernels (MDCT and MDST),thus no complex-valued operations are required in com-puting the FFT. In addition, in the DMT system, the lowerN-point FFT outputs are conjugate-symmetric to the up-per N-point outputs. We are only interested in N-point datafor k = 0, 1, . . . , N − 1. Hence, we can neglect the outputsX(k), for k = N,N + 1, . . . , 2N − 1.

3.2. MDCT/MDST operations of the FFT

In (32), the transform kernels are 2N-point MDCT(k) andMDST(k). Here, we propose a novel approach to further re-duce the computational complexity. Hence, we only need toperform N-point MDCT/MDST.

We first decompose input sequence into a symmet-ric sequence, xc(n), plus an antisymmetric sequence, xs(n),where

xc(n)�= 1

2

[x(n) + x(2N − n)

],

xs(n)�= 1

2

[x(n)− x(2N − n)

], for n = 1, 2, . . . , N − 1.

(33)

Hence, we have

x(n) = xc(n) + xs(n), (34)

x(2N − n) = xc(n)− xs(n), for n = 1, 2, . . . , N − 1. (35)

By substituting (34) and (35) into (30), we can simplify (30)as (see Appendix B)

X(k) ={x(0) + x(N)(−1)k

+ 2

[ N−1∑n=0

xc(n) cos2πnk2N

− jN−1∑n=0

xs(n) sin2πnk2N

]}

= {x(0) + x(N)(−1)k + 2[

MDCT(k)− j MDST(k)]},

for k = 0, 1, . . . , N − 1,(36)

where xc(0) = 0 and xs(0) = 0. Since the block size is reducedfrom 2N-point (see (32)) to N-point (see (36)).

Next, following the derivations of the IFFT in Section 2,we can have


MDCT(k) = g(k) +1

2Ck2Nh(k)

=N/2−1∑n=0

xc(2n)CnkN︸︷︷︸N/2-point MDCT, g(k)

+1

2Ck2N

[ N/2−1∑n=0

[xc(2n + 1) + xc(2n− 1)

]CnkN︸︷︷︸

N/2-point MDCT, h′′(k)

+ xc(N − 1)(−1)k︸︷︷︸injected item

],

(37)

MDCT(N − k) = g(k)− 1

2Ck2Nh(k)

=N/2−1∑n=0

xc(2n)CnkN

− 1

2Ck2N

[ N/2−1∑n=0

[xc(2n + 1)

+ xc(2n− 1)]CnkN

+ xc(N − 1)(−1)k],

for k = 0, 1, . . . ,N

2− 1.

(38)

Similarly, for the MDST(k), we have

MDST(k) = g(k) +1

2Ck2Nh(k)

=N/2−1∑n=0

xs(2n)SnkN

+1

2Ck2N

N/2−1∑n=0

[xs(2n + 1) + xs(2n− 1)

]SnkN ,

(39)

MDST(N − k) = −g(k) +1

2Ck2Nh(k)

= −N/2−1∑n=0

xs(2n)SnkN

+1

2Ck2N

N/2−1∑n=0

[xs(2n + 1) + xs(2n− 1)

]SnkN ,

for k = 0, 1, . . . ,N

2− 1.

(40)

The two special cases for index N/2 are

MDCT(N

2

)=

N−1∑n=0

xc(n) cosnπ

2,

MDST(N2

)=

N−1∑n=0

xs(n) sinnπ

2.

(41)

The block diagram of the MDCT(k) is shown in Figure 7.The mapping of the MDST structure is similar to the MDCTstructure in Figure 7 except that minimum processing blockis 2-point MDST and the injected items do not exist in theMDST(k) implementation (see Figure 8). Then we can justcombine the MDCT(k) and MDST(k) outputs, followed byadding x(0) and x(N)(−1)k, to obtain the FFT results basedon (36).

3.3. Overall FFT computation procedures

The overall computation flow of the FFT is shown inFigure 9. The operations are as follows.

(1) The received signals x(n), n = 0, 1, . . . , 2N − 1, aredecomposed to xc(n) and xs(n), n = 0, 1, . . . , N − 1, throughthe pre-processing operation.

(2) In the first phase, the generated xc(n) are fed into re-cursive butterfly operation to obtain the MDCT(k) outputs.

(3) In the second phase, we repeat the computation byusing the xs(n) as inputs into recursive butterfly operation toobtain the MDST(k) outputs.

(4) We combine the MDCT(k) and MDST(k) resultsthen add x(0) and x(N)(−1)k together to obtain the FFT re-sults based on (36). This is done in the post-processing oper-ation.

3.4. Matrix notation of the MDCT/MDSTBased on (19), (20), (21), (22) (23), (24), (25), and (26), wecan represent (37), (38), and (39) as

[TN,MDCT

] =[ [

TN/2] [

ψN,MDCT][

TN/2][JN/2

] −[ψN,MDCT][JN/2

]], (42)

where [ψN,MDCT] = [ΦN/2]([LN/2][TN/2] + [ON/2])‖[SN ],

[TN,MDST

] =[ [

TN/2] [

ψN,MDST]

−[TN/2][JN/2] [ψN,MDST

][JN/2

]], (43)

where [ψN,MDST] = [ΦN/2][LN/2][TN/2][SN ], of theMDCT(k)/MDST(k), respectively. The block diagramsof the MDCT(k) and MDST(k) are very similar to theMDCT(n) and MDST(n) in Section 2. The difference is thatit requires a pre-processing to compute the xc(n) and xs(n).The block diagrams of the MDCT and MDST are shown inFigures 10 and 11, respectively.

4. COMPLEXITY COMPARISON ANDFINITE-PRECISION EFFECT

4.1. Comparison of hardware complexityIn this section, we compare the computation complexity ofthe proposed algorithm with the traditional Cooly-Tukey


2N-point

x(0)

x(1)

...

x(2N−2)

x(2N−1)

Pre-processing

N-point

xc(0)

xc(1)

...

xc(N−2)

xc(N−1)

Even-oddindex

mapping

xc(0)

xc(2)

...

xc(N − 4)

xc(N − 2)

xc(1)

xc(3)...

xc(N − 5)xc(N − 3)

xc(N − 1)

+...

+

+

N/2-point

MDCTg(k)

N/2-point

MDCTh′′(k)

xc(N−1)

−xc(N−1)

xc(N−1)

−xc(N−1)

+

+

+

+

...

Γ0

Γ1

...

...

++++...

++

++

+− ...

+−+−

MDCT (0)

MDCT (1)

...

MDCT (N/2 − 2)

MDCT (N/2 − 1)

Special case

MDCT (N/2)

MDCT (N/2+1)

MDCT (N/2+2)

...

MDCT (N − 1)

Oddsummation

Injecteditems

Γk = 1/2Ck2Nk : 0 ∼ N/2 − 1

N-point MDCT

xc(n) 1-pointMDCT

MDCT(k) xc(n) MDCT(k)

Figure 7: N-point MDCT(k) butterfly structure, where 1-point MDCT is the minimum-sized processing block of the FFT module.

2N-point

x(0)

x(1)

...

x(2N−2)

x(2N−1)

Pre-processing

N-point

xs(0)

xs(1)

...

xs(N−2)

xs(N−1)

Even-oddindex

mapping

xs(0)

xs(2)...

xs(N − 4)

xs(N − 2)

xs(1)

xs(3)...

xs(N − 5)

xs(N − 3)

xs(N − 1)

+...

+

+

N/2-point

MDSTg(k)

N/2-point

MDSTh′′(k)

Γ0

Γ1

......

...

++++...

++++

+−...

+−+−

MDST (0)

MDST (1)...

MDST (N/2 − 2)

MDST (N/2 − 1)

Special case

MDST (N/2)MDST (N/2 + 1)MDST (N/2 + 2)

...

MDST (N − 1)

Oddsummation

Injecteditems

Γk = 1/2Ck2Nk : 0 ∼ N/2 − 1

N-point MDST

xs(n0)

xs(n1)

2-pointMDST

MDST(k0)

MDST(k1)

xs(n0)

xs(n1)

MDST(k0)

MDST(k1)

Figure 8: N-point MDST(k) butterfly structure, where the 2-point MDST is the minimum-sized processing block of the FFT module.


2N-point

x(0)

x(1)

...

x(2N−2)

x(2N−1)

Pre-processing

N-point N-point

xs(0), xc(0)

xs(1), xc(1)

......

xs(N−2), xc(N−2)

xs(N−1), xc(N−1)

2ndphase

1stphase

N-pointMDCT(1st phase)MDST(2nd phase)

x(0) + x(N)(−1)k

+ RX(0)

......

x(0) + x(N)(−1)k

+ RX(N/2)

......

x(0) + x(N)(−1)k

+ RX(N − 1)

Post-processing

Figure 9: The proposed FFT architecture.

x(0)

x(1)

...

x(2N−2)

x(2N−1)

Pre-processing

xs(0)

xs(1)

...

xs(2N−2)

xs(2N−1)

Even-odd

indexmapping

Even

Odd [LN/2]

xc(N − 1)

[TN/2]

[TN/2]

[ON/2]

[SN ]

+ [ΦN/2]

++

+

+

−+ [JN/2]

MDCT (k)N

Figure 10: Block diagram of the MDCT in matrix form for the FFT operation.

x(0)

x(1)

...

x(2N−2)

x(2N−1)

Pre-processing

xs(0)

xs(1)

...

xs(N − 2)

xs(N − 1)

Even-odd

indexmapping

Even

Odd [LN/2]

[TN/2]

[TN/2]

[SN ]

[ΦN/2]

++

+

+

−+ [JN/2]

MDST (k)N

Figure 11: Block diagram of the MDST in matrix form for the FFT operation.

algorithm. The corresponding butterfly architecture requireslog2(2N) stages in the 2N-point IFFT/FFT. Each stageconsists of N multiplications and 2N additions. Because

input sequences are complex data, the IFFT/FFT kernelsare complex in nature. Hence, it requires 4 real-valuedmultiplications and 2 real-valued additions for 1 complex


Table 1: Comparison of computational complexity for 2N-point IFFT/FFT.

IFFT FFT

Cooly-Tukey [10] Chan et al. Cooly-Tukey [10] Chan et al.

(O1) (O2) CR (O1) (O2) CR

N 4N log2 2N N log2 N − 2N + 2 4N log2 2N N log2 N − 2N + 2

256 9216 1538 0.169 9216 1538 0.169

512 20480 3586 0.175 20480 3586 0.175

1024 45056 8194 0.182 45056 8194 0.182

2048 98304 18434 0.188 98304 18434 0.186

4096 212992 40962 0.192 212992 40962 0.192

8192 458752 90114 0.196 458752 90114 0.196

(a) Number of multiplication operations.

IFFT FFT

Cooly-Tukey [10] Chan et al. Cooly-Tukey [10] Chan et al.

(O1) (O2) CR (O1) (O2) CR

N 6N log2 2N (9/2)N log2 N +N + 1 6N log2 2N (9/2)N log2 N +N

256 13824 9473 0.685 13824 9472 0.685

512 30720 21249 0.692 30720 21248 0.692

1024 67584 47105 0.697 67584 47104 0.697

2048 147456 103425 0.701 147456 103424 0.701

4096 319488 225281 0.705 319488 225281 0.705

8192 688128 487425 0.708 688128 487424 0.708

(b) Number of addition operations.

multiplication. Also, it takes 2 real additions to realize a com-plex addition. As a result, the direct approach requires a to-tal of 4N log2(2N) real multiplications and 6N log2(2N) realadditions. The large computation complexity are not suitablefor cost-effective realization of the IFFT/FFT modules in theDMT system.

The complexity comparison for 2N-point IFFT/FFT arelisted in Table 1. The complexity ratio (CR) is defined as

CR�= O2

O1, (44)

where O1 and O2 are the number of multiplications (or ad-ditions) in other fast algorithms and our approach, respec-tively. We can see that the complexity ratio of the multi-plication is only 17% for N = 256 compared with conven-tional IFFT/FFT. Table 1 also shows that our approach cangain more computation savings as N gets larger in the VDSLsystems [14].

4.2. Experiment results

There are lots of DSP processors on the market. Due tothe variety or hardware structure, coding styles, compli-ers, and so forth, we are not trying to do the detail op-timization for specific processors. On the other hand, wewould like to compare the proposed algorithm with Cooly-Tukey’s algorithm, which is a baseline of the FFT realiza-tion. The implementation platform is TI TMS320C54 eval-uation board, http://www.ti.com. Both algorithms are writ-ten in C language without any assembly-level program-ming tricks. During compilation, the TI C54X C com-plier is used without adding special compilation options,neither.

Table 2 shows the comparison of the proposed algorithmand the conventional FFT in terms of clock cycles. As we cansee, the proposed algorithm requires only about 30% clockcycles of the Cooly-Tukey’s. The result is very consistent withour observation in Table 1.


Table 2: Comparison of clock cycle for Cooley-Tukey FFT and pro-posed recursive algorithm.

128-point 256-point 512-point

Cooley-Tukey FFT 16,485 37,118 82,347

Proposed 11,869 25,726 55,435

Clock cycle Ratio 28% 31% 33%

0 5 10 15 20 25 30 35 40 45 50Wordlength (B)

0

50

100

150

200

Ave

rage

dSN

R(d

B)

Direct butterfly approachOur approach

(a)

0 5 10 15 20 25 30 35 40 45 50Wordlength (B)

0

50

100

150

200

Ave

rage

dSN

R(d

B)

Direct butterfly approachOur approach

(b)

Figure 12: Averaged SNR versus wordlength for the 512-point (2Nvalue) (a) IFFT. (b) FFT.

4.3. Finite-precision effect

In fixed-point implementation of the IFFT/FFT kernels, it isimportant to consider the effects of finite register length inthe IFFT/FFT calculations (see [12, Chapter 9] and [15]). Tocompare the butterfly approach and our approach in fixed-point implementation, we conduct extensive computer sim-ulation by using MATLAB for finite-wordlength IFFT/FFTarchitecture. Figure 12 shows the SNR performance with as-signed wordlength B = 8, 16, 32 bits. We observe that theSNR performance with B =16 bits is good enough in prac-tical fixed-point implementations. From the simulation re-sults, we can see that the SNR performance of our approachis comparable to the traditional butterfly approach under thesame wordlength.

5. CONCLUSIONS

In this paper, we develop a computationally efficient fast al-gorithm for the software implementation of the IFFT/FFTkernel in the DMT system. We reformulate the IFFT/FFTfunctions so as to avoid complex-domain operations. Thecomplexity ratio of the multiplications is only 17% comparedwith the direct butterfly implementation approach. The pro-posed algorithm provides a good solution in reducing MIPScount in programmable DSP implementation for the appli-cations of the DMT transceiver systems.

APPENDICES

A. DERIVATION OF (4)

Decomposing (4) into the first half and second half with thefact that X(0) = X(N) = 0, (4) can be represented as

x(n) = 12N

·[ N−1∑

k=1

X(k)W−nk2N +

2N−1∑k=N+1

X(k)W−nk2N

]. (A.1)

Use k′ = 2N − k to replace the variable in the second term.Then, we have

x(n)= 12N·[ N−1∑

k=1

X(k)W−nk2N +

1∑k′=N−1

X(2N−k′)W−(2N−k′)n2N

].

(A.2)

Because k′ is a dummy variable, we can rewrite (A.2) as

x(n) = 12N

·[ N−1∑

k=1

X(k)W−nk2N +

N−1∑k=1

X(2N − k)W−(2N−k)n2N

]

= 12N

·[ N−1∑

k=1

X(k)W−nk2N +

N−1∑k=1

X(2N − k)W−2Nn2N Wnk

2N

].

(A.3)

By using the facts that

W2Nn2N = 1,

Wnk2N = exp

(− j

2πnk2N

)= cos

2πnk2N

− j sin2πnk2N

,

W−nk2N = exp

(j2πnk2N

)= cos

2πnk2N

+ j sin2πnk2N

,

X(0) = X(N) = 0,

(A.4)

we can rearrange (A.3) to

x(n) = 1N·[ N−1∑

k=0

(Xr(k) cos

2πnk2N

− Xi(k) sin2πnk2N

)].

(A.5)


B. DERIVATION OF (30)

Equation (30) can be represented as

X(k) = x(0)+x(N)(−1)k+

[ N−1∑n=1

x(n)Wnk2N+

2N−1∑n=N+1

x(n)Wnk2N

].

(B.1)

Use n′ = 2N − n to replace the variable in the second term.Then, we have

X(k) = x(0) + x(N)(−1)k

+

[ N−1∑n=1

x(n)Wnk2N +

1∑n′=N−1

x(2N − n′)Wk(2N−n′)2N

].

(B.2)

Because n′ is a dummy variable, we can rewrite (B.2) as

X(k) = x(0) + x(N)(−1)k

+

[ N−1∑n=1

x(n)Wnk2N +

N−1∑n=1

x(2N − n)Wk(2N−n)2N

]

= x(0) + x(N)(−1)k

+

[ N−1∑n=1

x(n)Wnk2N +

N−1∑n=1

x(2N − n)W2kN2N W−nk

2N

].

(B.3)

By using the fact thatW2kN2N = 1 and applying the assumption

of the input data in (35), we can rearrange (B.3) as

X(k) = x(0) + x(N)(−1)k

+ 2

[ N−1∑n=1

xc(n) cos2πnk2N

− jN−1∑n=1

xs(n) sin2πnk2N

].

(B.4)

ACKNOWLEDGMENT

T. S. Chan is with the VXIS Tech. Corp. Hsin-Chu, Taiwan,ROC. This work is supported in part by the National ScienceCouncil, ROC, under Grant NSC 87-2213-E-008-011.

REFERENCES

[1] G. H. Im, D. D. Harman, G. Huang, A. V. Mandzik, M. H.Nguyen, and J. J. Werner, “51.84 Mb/s 16-CAP ATM LANstandard,” IEEE Journal on Selected Areas in Communications,vol. 13, no. 4, pp. 620–632, 1995.

[2] J. S. Chow, J. C. Tu, and J. M. Cioffi, “A discrete multitonetransceiver system for HDSL applications,” IEEE Journal onSelected Areas in Communications, vol. 9, no. 6, pp. 895–908,1991.

[3] K. Sistanizadeh, P. Chow, and J. M. Cioffi, “Multi-tone trans-mission for asymmetric digital subscriber lines (ADSL),” inProc. IEEE International Conf. on Communications, vol. 2, pp.756–760, Geneva, Switzerland, 1993.

[4] I. Lee, J. S. Chou, and J. M. Cioffi, “Performance eval-uation of a fast computation algorithm for the DMT inhigh-speed subscriber loop,” IEEE Journal on SelectedAreas in Communications, vol. 13, no. 9, pp. 1560–1570,1995.

[5] T. N. Zogakis, J. T. Aslanis Jr., and J. M. Cioffi, “A coded andshaped discrete multitone system,” IEEE Trans. Communica-tions, vol. 43, no. 12, pp. 2941–2949, 1995.

[6] B. Daneshrad and H. Samueli, “A 1.6 Mbps digital-QAM sys-tem for DSL transmission,” IEEE Journal on Selected Areas inCommunications, vol. 13, no. 9, pp. 1600–1610, 1995.

[7] B. R. Wiese and J. S. Chow, “Programmable implementationsof xDSL transceiver systems,” IEEE Communications Maga-zine, vol. 38, no. 5, pp. 114–119, 2000.

[8] A.-Y. Wu and T. S. Chan, “Cost-efficient parallel lattice VLSIarchitecture for the IFFT/FFT in DMT transceiver technol-ogy,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Pro-cessing, pp. 3517–3520, Seattle, Wash, USA, May 1998.

[9] B. G. Lee, “A new algorithm to compute the discrete cosinetransform,” IEEE Trans. Acoustics, Speech, and Signal Process-ing, vol. 32, no. 6, pp. 1243–1245, 1984.

[10] J. W. Cooly and J. W. Tukey, “An algorithm for the machinecalculation of the complex Fourier series,” Math. Comp., vol.19, pp. 297–301, April 1965.

[11] ANSI Standard T1.413, “Network and customer installationinterface-Asymmetric digital subscriber line (ADSL) metallicinterface,” 1995.

[12] A. V. Oppenheim and R. W. Schafer, Discrete-Time Signal Pro-cessing, Prentice-Hall, Englewood Cliffs, NJ, USA, 1989.

[13] H. D. Yun and S. U. Lee, “On the fixed-point-error analysis ofseveral fast DCT algorithms,” IEEE Trans. Circuits and Systemsfor Video Technology, vol. 3, no. 1, pp. 27–41, 1993.

[14] T1E1.4/2000-013R3, “Very-high-speed digital subscriberlines (VDSL) metallic interface, part 3: Technical specificationof a multi-carrier modulation transceiver,” 2000.

[15] K. J. R. Liu, A.-Y. Wu, A. Raghupathy, and J. Chen,“Algorithm-based low-power and high-performance multi-media signal processing,” Proceedings of the IEEE, vol. 86, no.6, pp. 1155–1202, 1998, Special Issue on Multimedia Signal

Processing.

Tsun-Shan Chan was born in Chang-Hui,Taiwan, ROC, in 1973. He received hisM.S. degree in electrical engineering fromthe National Central University, Taiwan, in1998. During 1998–1999, he worked oncommunication applications in IndustrialTechnology Research Institute, Hsin-Chu,Taiwan. Since 1999, he has been serving asa system engineer of the video processingprojects in VXIS Technology Corporation.

Jen-Chih Kuo received his B.S. degree inelectrical engineering from the Nation Tai-wan University, Taiwan, in 2000. He is nowin the Graduate Institute of Electronics En-gineering of the same school. His researchinterests include VLSI architectures for DSPalgorithms, adaptive signal processing, anddigital communication systems.


An-Yeu (Andy) Wu received his B.S. de-gree from National Taiwan University in1987, and the M.S. and Ph.D. degrees fromthe University of Maryland, College Parkin 1992 and 1995, respectively, all in elec-trical engineering. During 1987–1989, heserved as a signal officer in the Army, Taipei,Taiwan, for his mandatory military service.During 1990–1995, he was a graduate teach-ing and research assistant with the Depart-ment of Electrical Engineering and Institute for Systems Researchat the University of Maryland, College Park. From August 1995 toJuly 1996, he was a Member of Technical Staff at AT&T Bell Labo-ratories, Murray Hill, NJ, working on high-speed transmission ICdesigns. From 1996 to July 2000, he was with the Electrical Engi-neering Department of National Central University, Taiwan. He iscurrently an Associate Professor with the Department of Electri-cal Engineering Department and Graduate Institute of ElectronicsEngineering of National Taiwan University, Taiwan. His researchinterests include low-power/high-performance VLSI architecturesfor DSP and communication applications, adaptive signal process-ing, and multirate signal processing.


A DSP Based POD Implementation for High SpeedMultimedia Communications

Chang Nian ZhangDepartment of Computer Science, University of Regina, TRLabs, SK, Canada S4S 0A2Email: [email protected]

Hua LiDepartment of Mathematics and Computer Science, University of Lethbridge, Lethbridge, Alberta, Canada T1K 3M4Email: [email protected]

Nuannuan Zhang

Department of Computer Science, University of Regina, TRLabs, SK, Canada S4S 0A2

Jiesheng Xie

Department of Computer, China Agriculture University, Beijing 100083, China

Received 20 August 2001 and in revised form 10 August 2002

In the cable network services, the audio/video entertainment contents should be protected from unauthorized copying, intercept-ing, and tampering. Point-of-deployment (POD) security module, proposed by OpenCableTM, allows viewers to receive securecable services such as premium subscription channels, impulse pay-per-view, video-on-demand as well as other interactive ser-vices. In this paper, we present a digital signal processor (DSP) (TMS320C6211) based POD implementation for the real-timeapplications which include elliptic curve digital signature algorithm (ECDSA), elliptic curve Diffie Hellman (ECDH) key ex-change, elliptic curve key derivation function (ECKDF), cellular automata (CA) cryptography, communication processes betweenPOD and Host, and Host authentication. In order to get different security levels and different rates of encryption/decryption, a CAbased symmetric key cryptography algorithm is used whose encryption/decryption rate can be up to 75 Mbps. The experimentresults indicate that the DSP based POD implementation provides high speed and flexibility, and satisfies the requirements ofreal-time video data transmission.

Keywords and phrases: point-of-deployment, DSP, cellular automata, copy protection, ECDSA, DH key exchange.

1. INTRODUCTION

The next generation of the cable networks requires that a se-curity module should be built and separated from the hostdevices (set top boxes and integrated digital televisions) inorder to facilitate commercial sale of navigational devices.The point-of-deployment (POD) security module is beingdeveloped to satisfy these separable security requirements andto enable retail availability of Host devices [1, 2, 3].

The POD module supports two major functions.(1) The POD will provide the cable operator with a secure

device at the customer’s location.(2) The POD will act as a translator so that the Host de-

vice will only have to understand a single protocol, regard-less of the type of network to which it is connected. Since thedraft of the specification of the POD module was released in

fall 1997, several POD products have been reported. All ofthem are the application specific integrated circuits (ASIC),and use data encryption standard (DES) as the preliminarytechnique for content encryption/decryption. But DES hasbeen proved not secure enough and will be replaced by thenew standards. Moreover, due to the nature of cable networkservices, different applications require different security lev-els. It is desirable for POD to provide versatile cryptogra-phy schemes. On the other hand, since the current specifi-cation of the POD module has not been accepted as an inter-national standard, any further modifications of the standardwill cause redesigning and rebuilding of the ASIC POD.

In order to provide a low cost and flexible POD, a DSPbased POD implementation is proposed in this paper whichsatisfies the requirements of real-time video data transmis-sion and can be applied in different security levels. The


Cable headend

POD module

OCI-N

OCI-C2

OCI-C1

Host device(Set top or digital TV)

Consumer device(HDTV, digital VCR, etc.)

Figure 1: OpenCable network and consumer interfaces.

outline of the remainder of the paper is as follows. Section 2introduces the POD security module including the overviewof POD, its functionalities, and algorithms used in POD.Section 3 presents the POD implementation based on DSP.Section 4 is the conclusion.

2. FUNCTIONS OF POD MODULE

The set top box (STB) is a commonly used interface betweendigital television and the functions accessible via cable net-work in the architecture of next-generation television andvideo systems. It attaches a point-of-deployment (POD) se-curity plug-in module to provide the security and copy pro-tection of the contents.

Figure 1 illustrates logically how the POD module inter-face connects with other OpenCable interfaces. In Figure 1,OCI-N (OpenCable interface network) is the interface be-tween a cable network and the Host device. OCI-C1 (Open-Cable interface consumer 1) is the interface between a Hostdevice and a digital consumer device. OCI-C2 (OpenCableinterface consumer 2) is the interface between a Host deviceand the POD module.

The primary functions of the OpenCable POD moduleinclude: (1) provide conditional access to a Host device; (2)provide communication and control between the headendand the Host device. The POD module decrypt the contentsunder control of the headend and re-encrypt the contentsfor the purpose of copy protection between the POD moduleand Host device. Typically, the POD is authorized by the con-ditional access system to decrypt contents, and authorizes theHost by delivering either clear or CP (copy protection) en-crypted content. The content passing the POD interface canbe one of the following three formats: (1) Cleartext, (2) Pass-ing through, (3) Rescramble. The copy protection betweenthe POD and the Host works as follows.

Step 1 (Initialization of the POD and the Host evaluation).When the POD is powered on, it checks if the Host supportsOpenCableTM content protection by checking the availabilityof the CP resource and verifying the authenticity of the devicecertificate.

Step 2 (Host authentication). The POD retrieves the Hostcertificate Data to initiate the authentication procedure and

the Host replies to it. After this exchange, both the POD andthe Host come up with the authentication key.

Step 3 (Key exchange). The POD sends its DH (Diffie-Hellman) public key to the Host and requests the Host’s DHpublic key and then the Host sends its DH public key to thePOD. After this exchange, both the POD and the Host comeup with a common secret value. By using a method coveredby intellectual property, they establish the shared secret keysderived from the Host authentication process.

Step 4 (Interface encryption). The POD uses the secret keyto encrypt the content.

The cryptography schemes used in POD include:

(1) Elliptic curve digital signature algorithm (ECDSA),which is used in the Host authentication process forsigning and verification.

(2) Diffie-Hellman (DH) public key agreement algorithm,which provides a method for POD and Host to com-pute a shared secret value, that is, used in the contentencryption/decryption key generation.

(3) SHA-1 (secure hash algorithm) [4], which is used inthe digital signature algorithm to generate a messagedigest of length 160 bits. For the POD, the SHA-1 algo-rithm is used for Host certificate signature verification,authentication key generation and copy protection keygeneration.

(4) Elliptic curve key derivation function (ECKDF) algo-rithm, which is used to generate the key for the contentprotection.

Moreover, a random number generator is included togenerate DH private keys which will be compliant withthe SHA-1 based algorithm. Each OpenCable device has aunique seed value which is set by the manufactory.

Figure 2 illustrates the cryptographic functions used inthe POD copy protection.

3. A DSP BASED POD IMPLEMENTATION

3.1. Introduction of DSP C6211

Texas Instruments (TI) TMS320C6000 generation [5] isbased on VelociTITM architecture, an advanced architecture

A DSP Based POD Implementation for High Speed Multimedia Communications 977

Diffie-Hellman key exchange

POD Host

SHA-1 SHA-1

Key derivefunction

Key derivefunction

PlaintextDES, CA

encryptionon MPEGtransport

Copy portectedDES, CA

encryptionon MPEGtransport

Figure 2: Cryptographic functions used in POD copy protection.

for DSPs with very long instruction word (VLIW). TheVLIW architecture makes it very suitable for the multichan-nel and multifunction applications. TMS320C6211 (C6211for short) provides 1200 MIPS (million instructions per sec-ond) at 150 MHz, and the TMS32062xx devices are the fixed-point DSP family. The cache architecture in C6211 provideslow cost and high performance capabilities.

C6211 has 32 general purpose registers of 32 bit wordlength and eight highly independent functional units. Theeight functional units provide six arithmetic logic units(ALUs) for a high degree of parallelism and two 16-bit multi-pliers. The development tools of C6211 include: C compiler,assembly optimizer to simplify programming and schedul-ing, and WindowsTM debugger interface for visibility intosource code execution [6]. The DSP based POD can greatlyreduce the hardware design period, since it can easily repro-gram when the specifications of POD are to be modified ornew components are added.

3.2. Cryptography algorithms used in the DSPbased POD

In order to make the POD more efficient, we use ECKDFwhich is based on the elliptic curve cryptography [7, 8]for the key derive function; and use cellular automata (CA)based symmetric-key cryptographic algorithm for mediacontent protection.

ECDSA algorithm is applied in POD to authenticate theHost, which includes three parts: key schedule which is to setup the key, signature procedure, and verification process asillustrated in Figure 3.

Elliptic curve Diffie Hellman (ECDH) primitives is the

Key generation

Q = dP

Signature generation(r, s)

Verification procedurev

Check if r = v?

Figure 3: ECDSA algorithm.

System parameters(elliptic curve)

POD Host

Private key

dP

Private key

dH

Public keyQP

Public keyQH

POD Host

shared valuezP

shared valuezH

zP = zH

Figure 4: ECDH protocol between POD and Host.

basis for the operation of elliptic curve encryption scheme.For the POD, we use this algorithm to exchange the key be-tween the POD and the Host. Figure 4 illustrates the flowchart of ECDH algorithm.

Suppose POD (P) and Host (H) will communicate witheach other, and require the key exchange. Here we use dP andQP to represent the P’s private key and public key which areobtained from the key schedule. dH and QH denote H ’s pri-vate key and public key, respectively. P performs the follow-ing steps:

/* setup the scheme */create the elliptic curve;

/* compute the elliptic curve point */VP = (xp, yp) = dPQH;


return the x component of VP as theshared secret key (zP).

Similarly, H uses the same primitive to get the shared se-cret key.

/* setup the scheme */create the elliptic curve;

/* compute the elliptic curve point */VH = (xh, yh) = dHQP;

return the x component of VH as theshared secret key (zH).

By running ECDH algorithm, we have PV = PU . That is,two parties get the same secret key.

In the POD implementation, ECKDF key derivationfunction is used to generate the common key for content en-cryption and decryption. The following is the description ofECKDF key derivation function:

check the length of input data (z);initiate a Counter = 1;for (i = 0; i < n; i + +){/* compute the hash value */ki = h(z | Counter);

increment Counter;}set the key K = k1 | k2 | · · · | kn,

where “ | ” means concatenation, and h stands for hash func-tion SHA-1. By applying this function, we can generate dif-ferent key sizes as required.

In the following, we introduce the cellular automatabased symmetric-key cryptography algorithm and how it isapplied in POD. Cellular automata (CA) is an array of cellswhere each cell is in any of the permissible states. For exam-ple, in a 2-state CA, each cell’s state can be zero or one. Ina k-neighborhood CA, at each clock cycle, the evolution ofa cell value depends on its rule and the present states of itsneighbors. The following three CA rules have special charac-teristics which can be applied in message encryption:

Rule 51: xi(t) = xi(t − 1),

Rule 195: xi(t) = xi−1(t − 1)⊕ xi(t − 1),

Rule 153: xi(t) = xi(t − 1)⊕ xi+1(t − 1).

(1)

Theorem 1. Applying complemented rules of 195, 153, and 51to a CA forms a CA group [9].

Theorem 2. If a CA configures with rules 51, 153 and 195,then its state transition diagram consists of equal cycles of evenlength.

Thus, if we choose rules of 51, 153, 195 as a group CA,then the fundamental transformations are self-inverse, thatis, the decryption is carried out in the same way as encryp-tion. Assuming the rule matrix is T , then we have

T2n = Tn · Tn = I (the identity matrix). (2)

1 0 1 · · · 1 0

Control bit Different rule

Figure 5: Overview of rule applied to message.

The CA-based block cipher scheme is as follows.

Encryption

E = Tn11 Tn2

2 · · ·Tnqq , C = EM. (3)

Decryption

M = E−1C

= (Tn11 Tn2

2 · · ·Tnqq)−1

C

=((Tnqq)−1 · · · (Tn2

2

)−1(Tn1

1

)−1)C

= (Tnqq · · ·Tn2

2 Tn11

)C,

(4)

where T1, T2, . . . , Tq are secret CA rules, which can be re-viewed as the subkeys of the block cipher. The flexibility ofCA based cryptosystem is that by choosing different valuesof n and q, we can achieve different security levels and dataencryption/decryption rates according to the application re-quirements.

In Figure 5, the first bit is the rule control bit where “0”stands for rule 51, and “1” stands for rule 195 or 153 whichwill be selected by the corresponding bit. The core proce-dures of the CA algorithm is described as follows:

temp51 = (~Message) & (~Rule); /* Implementthe rule 51 */switch(rule sign){

case 0: temp1 = Message � 1;temp195 = (~(Message ^ temp1)) & Rule;temp C Block = temp195;break;

case 1: temp2 = Message � 1;temp153 = (~(Message ^ temp2)) & Rule;temp C Block = temp153;break;

}C Block = temp51 | temp C Block.

Note that cycles used for encryption and decryption canbe variable as well in CA based cryptography. For example,if we set 2n = 8, that is, the message should be processed byapplying 8 times of CA rule during the procedure of encryp-tion and decryption, then we can choose the first four cyclesfor encryption and the other four cycles for decryption, orwe can use the first three cycles for encryption and anotherfive cycles for decryption.

A DSP Based POD Implementation for High Speed Multimedia Communications 979

Table 1: Test data for the encryption speed for CA (cycle < 5).

Test No rule No cycle No clk En speed (Mbps)

1 1 1 163 117.7912 1 2 318 60.3773 1 3 430 44.6514 1 4 542 35.4245 2 1 318 60.3776 2 2 623 30.8197 2 3 847 22.6688 2 4 1071 17.9279 3 1 430 44.65110 3 2 887 21.64611 3 3 1223 15.69912 3 4 1559 12.31613 4 1 542 35.42414 4 2 1151 16.68115 4 3 1599 12.00816 4 4 2047 9.38017 5 1 654 29.35818 5 2 1415 13.56919 5 3 1975 9.72220 5 4 2535 7.574

The CA based cryptography is used for video contentprotection in the POD implementation. The experiment in-dicates that the encryption/decryption rate can be up to75 Mbps, which satisfies the requirement of real-time datatransmission in the cable network.

3.3. Implementation

The algorithms used in POD are programmed by C languageand compiled with C6211 development tools, where the codecomposer studio compiles and converts the C programs intoassembly language. Finally, an executable file in .out formatis produced and loaded into the DSP.

Tables 1 and 2 list all the data tested from different rulesand cycles. In these tables, No rule stands for the number ofrules, No cycle means the number of process cycles for en-cryption, No clk is the DSP cycles running this program onDSP, and En speed is the speed of encryption which can becalculated by the following equation:

150× 128/No clk (Mbps). (5)

4. CONCLUSION

POD is the security module to be used in the cable networkand digital TV services. Its main function is to provide thecryptographic protocol in the interface between the POD andthe Host, and to protect the content passing through the in-terface. In this paper, a DSP based POD implementation isproposed by using TMS320C6211. The experiment indicatesthat the proposed POD implementation provides high dataspeed and flexibility to real-time applications. In order to getthe different degree of security and different speed of encryp-tion/decryption, we use a simple symmetric key encryption

Table 2: Test data for the encryption speed for CA (cycle ≥ 5).

Test No rule No cycle No clk En speed (Mbps)

1 1 5 654 29.3582 1 6 766 25.0653 1 7 878 21.8684 2 5 1295 14.8265 2 6 1519 12.6406 2 7 1743 11.0157 3 5 1895 10.1328 3 6 2231 8.6069 3 7 2567 7.48010 4 5 2495 7.69511 4 6 2943 6.52412 4 7 3391 5.66213 5 5 3095 6.20414 5 6 3655 5.25315 5 7 4215 4.555

algorithm—cellular automata cryptography for the contentprotection, whose encryption/decryption rate can be up to75 Mbps.

REFERENCES

[1] OpenCableTM, “OpenCableTM POD Copy Protection System,”IS-POD-CP-INT01-000107, January 2000.

[2] OpenCableTM, “OpenCableTM Host-POD Interface Specifica-tion,” IS-POD-131-INT01-991027, October 1999.

[3] Hitachi, Intel, MEI, Sony and Toshiba companies, Digi-tal Transmission Content Protection Specification (InformationalVersion) Revision 1.0, vol. 1, April 1999.

[4] National Institute of Standards and Technology(NIST), SecureHash Standard (SHS), FIPS Publication 180-1, April 1995.

[5] Texas Instruments, “How to Begin Development Todaywith the TMS320C6211 DSP,” Application report, SPRA474,September 1998.

[6] Texas Instruments, “TMS320C6000 Optimizing C Compiler—User’s Guide,” Digital signal processing solutions, 1999.

[7] Certicom research, “Standards for Efficient Cryptography, SEC1: Elliptic Curve Cryptography,” Working Draft, Version 0.5,Certicom Corp., 1999.

[8] M. Rosing, Implementing Elliptic Curve Cryptography, ManningPublications, Greenwich, Conn, USA, 1999.

[9] S. Nandi, B. K. Kar, and P. Pal Chaudhuri, “Theory and appli-cations of cellular automata in cryptography,” IEEE Trans. onComputers, vol. 43, no. 12, pp. 1346–1357, 1994.

Chang Nian Zhang received his B.S. degreein applied mathematics from University ofScience Technology, China, and the Ph.D.degree in computer science and engineer-ing from Southern Methodist University. In1998, he joined Concordia University as aresearch assistant professor in Departmentof Computer Science. Since 1990, he hasbeen with University of Regina, Canada,in Department of Computer Science. Cur-rently he is a full professor and leads a research group in parallelprocessing, data security, and neural networks.


Hua Li received his B.E. and M.S. degreesfrom Beijing Polytechnic University andPeking University. He is a Ph.D. candidate inthe Department of Computer Science, Uni-versity of Regina. Currently, he works as anassistant professor at Department of Math-ematics and Computer Science, Universityof Lethbridge, Canada. His research inter-ests include parallel systems, reconfigurablecomputing, fault-tolerant, VLSI design, andinformation and network security. He is a member of IEEE.

Nuannuan Zhang was a graduate student in Department of Com-puter Science, University of Regina from September 1998 toSeptember 2000.

Jiesheng Xie is a professor in the Department of Computer, ChinaAgriculture University, Beijing. He was a visiting professor inthe Department of Computer Science, University of Regina fromSeptember 1999 to August 2000.


Wavelet Kernels on a DSP: A Comparison BetweenLifting and Filter Banks for Image Coding

Stefano GnaviCERCOM, Center for Multimedia Radio Communications, Dipartimento di Elettronica, Politecnico di Torino,Corso Duca degli Abruzzi 24, 10129 Torino, ItalyEmail: [email protected]

Barbara PennaCERCOM, Center for Multimedia Radio Communications, Dipartimento di Elettronica, Politecnico di Torino,Corso Duca degli Abruzzi 24, 10129 Torino, ItalyEmail: [email protected]

Marco GrangettoCERCOM, Center for Multimedia Radio Communications, Dipartimento di Elettronica, Politecnico di Torino,Corso Duca degli Abruzzi 24, 10129 Torino, ItalyEmail: [email protected]

Enrico MagliCERCOM, Center for Multimedia Radio Communications, Dipartimento di Elettronica, Politecnico di Torino,Corso Duca degli Abruzzi 24, 10129 Torino, ItalyEmail: [email protected]

Gabriella OlmoCERCOM, Center for Multimedia Radio Communications, Dipartimento di Elettronica, Politecnico di Torino,Corso Duca degli Abruzzi 24, 10129 Torino, ItalyEmail: [email protected]

Received 30 August 2001 and in revised form 30 April 2002

We develop wavelet engines on a digital signal processors (DSP) platform, the target application being image and intraframe videocompression by means of the forthcoming JPEG2000 and Motion-JPEG2000 standards. We describe two implementations, basedon the lifting scheme and the filter bank scheme, respectively, and we present experimental results on code profiling. In particular,we address the following problems: (1) evaluating the execution speed of a wavelet engine on a modern DSP; (2) comparing theactual execution speed of the lifting scheme and the filter bank scheme with the theoretical results; (3) using the on-board directmemory access (DMA) to possibly optimize the execution speed. The results allow to assess the performance of a modern DSPin the image coding task, as well as to compare the lifting and filter bank performance in a realistic application scenario. Finally,guidelines for optimizing the code efficiency are provided by investigating the possible use of the on-board DMA.

Keywords and phrases: wavelet, lifting scheme, filter bank, JPEG2000, DSP.

1. INTRODUCTION

A huge number of applications use the discrete wavelet trans-form (DWT) [1] as a means to extract relevant features fromsignals. Examples are reported in the fields of mathemat-ics, physics, numerical computing, and engineering, includ-ing image classification, feature detection, image denoising,image registration, and image compression, just to mention afew. Especially in the engineering field, there has been a con-

siderable interest in using wavelet transforms for image andvideo coding applications [2, 3]. As a result, the ISO/ITU-Thas selected the DWT as the transform coding kernel forthe new image compression standard, namely JPEG2000 [4],which will be released during 2001. Consequently, fast andcost-effective implementations of DWT kernels, compliantwith JPEG2000 specifications, are called for in order to makeits diffusion as widespread as possible.

While the wavelet transform of an image can be fairly


easily computed by means of a general-purpose personalcomputer, there obviously exist contexts where more com-pact, light-weight and less power-demanding computing de-vices are required. A recent trend [5] fosters the design ofreconfigurable systems that make use of digital signal pro-cessors (DSPs) and field-programmable gate array (FPGA).An example is given by the transmission of images from sci-entific space missions, where the images collected by the on-board sensors may undergo wavelet-based compression (e.g.,Rosetta Osiris [6]), with a DSP-based system being used ascomputational core. DSPs are also very often used to han-dle image and video processing tasks in consumer electron-ics [7]. The importance of wavelets on a DSP is witnessed bythe number of implementations proposed in the literature(cf. [8, 9]). In this paper, we focus on the study of the DSP-based implementation of a wavelet kernel; the target appli-cation is image coding with JPEG2000, with its extension tointraframe video coding (Motion-JPEG2000).

Until recently, DWT implementations were based on theso-called filter bank scheme [1], which computes the DWTof a signal by iterating a sequence of highpass and lowpassfiltering steps, followed by downsampling. In 1997 Sweldensproposed a new scheme, called lifting scheme (LS), as an al-ternative way to compute the DWT [10]. The LS has immedi-ately obtained a noteworthy success, as it provides several ad-vantages with respect to the filter bank scheme. The most in-teresting ones from the implementation standpoint are that

(i) the LS requires less operations than the filter bankscheme, with a saving of up to one half for very longfilters;

(ii) the LS allows to compute an integer wavelet transform(IWT), that is, a wavelet transform that maps integersto integers [11], thus enabling the design of embeddedlossless and lossy image encoders [12, 13].

This paper is focused on the development of a waveletkernel based on the LS, and using a DSP as the computa-tional core. The interest of this work is manifold. Firstly, froma pure implementation perspective, the performance evalu-ation of an optimized implementation of such a kernel on amodern DSP indicates the maximum sustainable processingrate. This can be used to estimate the number of images persecond that can be processed by, for example, a compressionengine such as JPEG2000, or the video frame rate that can besustained by a Motion-JPEG2000 encoder/decoder in DSP-based applications, for example, videoconferencing by meansof a PC card. Secondly, since the LS can be used to design aprogressive lossy-to-lossless compression algorithm [13], itis important to evaluate the execution speed of the IWT withrespect to the DWT. Thirdly, and distinctively novel in thispaper, from a signal processing point of view, there is a stronginterest in finding out to which degree the theoretically lowercomplexity of the LS translates into reduced execution speed;in fact, it is likely that the DSP architecture affects the per-formance of lifting and filter bank wavelet cores in a differ-ent fashion. In this work all aspects are considered, that is,an optimized DSP implementation of the LS is presented,and its performance is then compared with that of the fil-

G

H

2

2

G

H

2

2

HP

HP

LP

Figure 1: Block diagram of the filter bank scheme.

ter bank scheme; both DWT and IWT are considered. Theresults allow to assess the impact of the DSP architecture onthe performance of both algorithms, thus providing usefulguidelines for the architectural design and implementationof a wavelet-based processing system.

This paper is organized as follows. In Section 2, we brieflyreview the wavelet transform, focusing on the filter bankscheme and the LS in Sections 2.1 and 2.2, respectively. TheDSP implementations of both algorithms are described inSection 3. In Section 4, a performance evaluation of suchimplementations is proposed; in particular, results relatedto the execution speed are reported in Section 4.1, and acomparison between LS and filter bank scheme is presentedin Section 4.2. The possibility of improving performanceby means of direct memory access (DMA) is discussed inSection 4.3. Finally, in Section 5 conclusions are drawn.

2. WAVELET TRANSFORM

As already stated, the two main algorithms used to computethe DWT are the filter bank scheme and the LS, which arebriefly reviewed in Sections 2.1 and 2.2, respectively.

2.1. Filter bank scheme

The filter bank scheme (see [1]) is sketched in Figure 1,where the operations needed to compute the DWT of a one-dimensional signal are depicted. One level of decompositioninvolves that the input sequence is highpass and lowpass fil-tered by the analysis filters H(z) and G(z); the two resultingsequences are then downsampled by a factor two. More de-composition levels can be obtained by iterating this proce-dure on the lowpass branch, as shown in Figure 1. The two-dimensional extension is achieved by filtering and downsam-pling first along the rows, and then along the columns. Theinverse transform is achieved performing a similar sequenceof filtering and upsampling operations (see [1]).

2.2. Lifting scheme

As well known [1], a discrete-time filter can be representedby its polyphase matrix, which is built from the � transformsof the even and odd samples of its impulse response. The LSstems from the observation [14] that the polyphase matrixcan be factorized, leading to the implementation of one stepof the filter bank scheme as a cascade of shorter filters, whichact on the even and odd signal samples, followed by a nor-malization. In particular, the LS performs a sequence of pri-mal and dual lifting steps, as described in the following and

Wavelet Kernels on a DSP: A Comparison Between Lifting and Filter Banks for Image Coding 983

SPLITs1(z)

−

• t1(z)

−

• sm(z)

−

• tm(z)

−

•

1/K

K

LP

BP

Figure 2: Block diagram of the LS.

reported in the block diagram of Figure 2. The inverse trans-form is achieved performing the same steps in reversed order[14].

The polyphase representation of a discrete-time filterH(z) is defined as

H(z) = He(z2) + z−1Ho

(z2), (1)

where He(z) and Ho(z) are respectively obtained from theeven and odd coefficients of h[n] = �−1{H(z)}, where � de-notes the zeta transform. The synthesis filters H(z) and G(z)(lowpass and highpass, respectively) can thus be expressed interms of their polyphase matrix

P(z) =[He(z) Ge(z)

Ho(z) Go(z)

](2)

and P(z) can be analogously defined for the analysis filters.The Euclidean algorithm [14] can be used to decompose

P(z) and P(z) as

P(z) =m∏i=1

[1 si(z)

0 1

][1 0

ti(z) 1

]K 0

01K

,

P(z) =m∏i=1

[1 0

−si(z−1

)1

][1 −ti

(z−1

)0 1

] 1K

0

0 K

.

(3)

This factorization leads to the sequence of primal and duallifting steps shown in Figure 2.

The filters He(z), Ho(z), Ge(z), and Go(z), along withtheir analysis counterparts, are Laurent polynomials [14].Since the set of all Laurent polynomials exhibits a commu-tative ring structure, within which polynomial division withremainder is possible, long division between two Laurentpolynomials is not a unique operation [14]. Therefore, sev-eral different factorizations (i.e., pairs of {si(z)} and {ti(z)}filters) may exist for each wavelet. However, in case of DWTimplementation, all possible choices are equivalent.

An IWT, mapping integers onto integers, can be very sim-ply achieved rounding off the output of the si(z) and ti(z)filters right before adding or subtracting [11]; the roundingoperation introduces a nonlinearity in each filter operation.As a consequence, in the IWT the choice of the factorizationimpacts on both lossless and lossy compression, so makingthe transition from the DWT to the IWT not straightforward[15].

Table 1: Computational cost (number of multiplications plus ad-ditions) of lifting versus filter banks.

Filter Standard algorithm Lifting scheme

LEG(5, 3) 4(N +M) + 2 2(N +M + 2)

DB(9, 7) 4(N +M) + 2 2(N +M + 2)

SWE(13, 7) 3(N + N)− 2 3/2(N + N)

As already stated, the LS requires fewer operations thanthe filter bank scheme. The latter algorithm corresponds tomerely applying the polyphase matrix: only the samples thatare not discarded by the subsequent downsampling opera-tion are actually filtered. In order to compare the two algo-rithms, one can use the number of multiplications and addi-tions required to output a pair of samples, one on the lowpassand one on the highpass branch. As shown in [14], the cost oflifting tends, asymptotically for long filters, to one-half of thecost of the standard algorithm. Table 1 reports the formulas,presented in [14], to compute the cost of the two algorithmsfor the filters used in this work (see also Section 3). Here |h|and |g| are the degree of the highpass and lowpass filter (i.e.,the number of coefficients minus one); in the case that |h|and |g| are even, we set |h| = 2N and |g| = 2M. Note thatthe filter SWE(13,7), being an interpolating filter, has a dif-ferent formula, which also involves the number of vanishingmoments N .

3. IMPLEMENTATION

The LS and filter bank scheme have been implemented on thefloating-point Texas Instruments TMS320C6711 DSP board.The board comprises a 150 MHz floating-point processor,two memory regions, namely on-chip and off-chip, a directmemory access (DMA) controller, and some peripherals in-terfaces. The CPU core includes two sets of 16 registers (reg-ister file A, register file B), the on-chip memory divided intotwo cache memories (L1, L2), and the arithmetic and logicalunits (see [16]). Figure 3 shows the block diagram of the DSParchitecture.

In the following, we outline some features of our imple-mentation of the two algorithms, including the filters andtypes of boundary extensions used. The LS implementationis compliant with the specifications of the Final CommitteeDraft of JPEG2000 Part I (core coding systems) [4], which is,


Peripheral devices

CPU

Register file A

Register file B

ALU

sL1 L2 DMA controller

Ext. memory interface

Figure 3: Block diagram of the DSP architecture.

at the time of this writing, the latest publicly available docu-ment describing the standard.

The code profiling results for the two algorithms, re-ported in Section 4, have been obtained demanding the op-timization of the assembler code to the C compiler, which isknown to nearly achieve the same efficiency as an expert pro-grammer. For this reason, the code has been written in a sim-ple and plain style, so as to facilitate compiler optimization.Therefore, in this section we only give an overview of the im-plementations of the two algorithms, whereas we rather con-centrate on the profiling results (Section 4), which representthe main contribution of this article. Of course, one couldachieve some performance improvement by constraining theimplementations, for example, to support a limited numberof filters (even only one); nevertheless, this approach wouldnegatively impact on generality of application, which in thiswork has been preserved as far as possible.

As for boundary extension at the borders of the input sig-nal, which is necessary because the wavelet filters are non-causal, two possible extensions are considered.

(i) Symmetric extension: it performs mirroring of the sig-nal samples outside the signal support. If used with biorthog-onal symmetric filters, it allows to achieve perfect reconstruc-tion also at the image borders.

(ii) Zero padding: it consists in adding zeros before andafter the signal. This extension is not supported by theJPEG2000 standard, but is very simple, and hence sometimesused.

The filters supported by this implementation have beenselected according to the JPEG2000 standard:

• LeGall I(5,3) (LEG(5,3) in the following);• Daubechies (9,7) (DB(9,7) in the following);• Sweldens (13,7) (SWE(13,7) in the following).

The first two filters are explicitly embodied in JPEG2000,for the reversible and nonreversible transform, respectively.

Table 2: Factorization of LEG(5,3).

si(z), ti(z) = a0zdM + a1zdM−1 + a2zdM−2 + · · ·Filter: LEG(5,3)

dM a0 a1 a2 a3 K

s1(z) 0 0 0 0 0 1

t1(z) 1 0.5 0.5 0 0

s2(z) 0 −0.25 −0.25 0 0

Table 3: Factorization of DB(9,7).

si(z), ti(z) = a0zdM + a1zdM−1 + a2zdM−2 + · · ·Filter: DB(9,7)

dM a0 a1 a2 a3 K

s1(z) 0 0 0 0 0 1.2302

t1(z) 1 −1.5861 −1.5861 0 0

s2(z) 0 −0.0530 −0.0530 0 0

t2(z) 1 0.8829 0.8829 0 0

s3(z) 0 0.4436 0.4436 0 0

Table 4: Factorization of SWE(13,7).

si(z), ti(z) = a0zdM + a1zdM−1 + a2zdM−2 + · · ·Filter: SWE(13,7)

dM a0 a1 a2 a3 K

s1(z) 0 0 0 0 0 1

t1(z) 2 0.0625 −0.5625 −0.5625 0.0625

s2(z) 1 −0.03125 0.28125 0.28125 −0.3125

Note that the last filter is not supported by JPEG2000 Part I.However, it has been considered because, being a long filter,it allows to verify the asymptotic complexity of the LS.

The selection of the factorization of the wavelet filters,to be used in the LS, has been made following the direc-tives of the JPEG2000 standard, and is reported in Tables 2,3, and 4. The filter length, deducible from the acronym, al-lows to easily identify the N and M parameters previouslydefined.

As for the filter bank scheme, the input signal is filteredby the same kernels listed above, but using the expandedrather than the factorized representation. Notice that, whileperforming the convolution between the signal and the fil-ter impulse response, the samples that would be discardedby downsampling are not computed at all. For completeness,Tables 5, 6, and 7 report the coefficients of the filters em-ployed, up to the fourth decimal digit.


As stated in Section 1, the objective of this work is manifold;in particular, experimental tests have been carried out withthe following goals.

(1) To evaluate the absolute running time of an LS-basedwavelet kernel on a modern DSP; in particular, in view of


Table 5: LEG(5,3) filter.

i h0 h1

0 0.75 1

±1 0.25 −0.5

±2 −0.125 0

Table 6: DB(9,7) filter.

i h0 h1

0 0.6029 1.1151

±1 0.2669 −0.5913

±2 −0.0782 −0.0575

±3 −0.0169 0.0913

±4 0.0267 0

Table 7: SWE(13,7) filter.

i h0 h1

0 0.6797 1

±1 0.2813 −0.5625

±2 −0.1230 0

±3 −0.0313 0.0625

±4 0.0352 0

±5 0 0

±6 −0.0020 0

the implementation of an embedded lossy-to-lossless imagecompression system, to understand to which degree embody-ing an IWT capability may penalize the execution speed. Thismatter is discussed in Section 4.1.

(2) To find out how close to the theoretical value is theactual performance gain of the LS with respect to the filterbank scheme, in terms of execution speed. This matter is dis-cussed in Section 4.2.

(3) To study the possibility of exploiting an available on-board DMA, in order to speed up code execution. This mat-ter is discussed in Section 4.3.

The results shown in the following, and the compari-son between LS and filter bank scheme, have been reported(see Sections 4.1 and 4.2) in terms of the time needed toperform one level of transform on one image row; this hasbeen done so as to facilitate the interpretation of results. Theresults have been parameterized on the length of the inputdata vector, and execution times for dyadic lengths are re-ported. It has been found that the sum of such dyadic valuesyields a very accurate estimate of the multilevel transform.Of course, computing the wavelet transform of an image re-quires to perform both rowwise and columnwise filtering.However, it has been found that the time needed to computea columnwise filtering is the same as for rowwise filtering.Even though this behavior might seem surprising at a firstglance, it can be reasonably justified by the efficient manage-ment of memory accesses performed by the cache memory;a more detailed explanation is given in Section 4.3.

4.1. Absolute running times

The graphs in Figures 4, 5, and 6 report the absolute run-ning times achieved by the LS (in the DWT and IWT mode,respectively) and the filter bank scheme, in order to com-pute the one-level wavelet transform of a one-dimensionaldata vector, contiguously stored in the external memory. Theboundary extension used is the symmetric one.

The results reported on the graphs can be used to esti-mate the number of images per second that can be processedby these algorithms. Employing the LEG(5,3) filter and thesymmetric extension for a complete 2D one-level decompo-sition on a 256× 256 grey-scale image, the LS allows to pro-cess between 7 and 8 images per second, whereas the filterbank scheme only sustains between 4 and 5 images per sec-ond. Note that computing the IWT, rounding off the filteredcoefficients in the LS, leads to slower operation. The runningtimes of the IWT are from 10% to 25% larger than the DWTusing the LS.

If the wavelet kernel is thought of as the core of aJPEG2000 encoder, it is worth recalling that the wavelettransform is responsible for a significant part of the total en-coder and decoder running time. Some figures have been ob-tained by profiling the Jasper reference JPEG2000 implemen-tation, and have been reported in [17]. It turns out that, forprogressive lossless coding, the wavelet transform is responsi-ble of about 30% of the overall encoder and decoder runningtime. In the progressive lossy case this percentage is increasedto about 50% at the encoder, and 70% at the decoder. Thisimplies that it should be possible to encode/decode, with asingle DSP, about 2 256× 256 images per second in the inte-ger lossless mode (using the LEG(5,3) filter), and encode anddecode about 2 images per second in the lossy mode usingthe DB(9,7) filter. While this figures are suitable for an im-age coding application, it turns out that more powerful hard-ware, such as a multi-DSP system or an FPGA, is required tosustain real-time Motion-JPEG2000 video.

4.2. Comparison between lifting and filter bank

As stated, in [14] it is claimed that the LS requires asymptot-ically half the number of operations with respect to the filterbank scheme. We have compared the running time of ourLS and filter bank implementations, in order to understandhow the DSP architecture impacts on the performance gain.In particular, Table 8 reports the ratios between the runningtime of the filter bank scheme and the LS. Comparing thesefigures with the theoretical results, it can be noticed that theseratios are different from the theoretical values.

This behavior can be explained considering the archi-tectural features of the processor employed. The DSP usedin this work has an efficient pipeline, which can dispatch 8parallel instructions per cycle. Parallel instructions proceedsimultaneously through each pipeline phase, whereas serialinstructions proceed through the pipeline with a fixed rela-tive phase difference between instructions. Every time a jumpto an instruction not belonging to the pipeline occurs, thepipeline must be emptied and reloaded. Thus, in this case,the filtering operations that frequently update the pipeline


Absolute running times7.000E-03

6.000E-03

5.000E-03

4.000E-03

3.000E-03

2.000E-03

1.000E-03

0.000E+00

Seco

nds

256 512 1024 2048 4096

Number of samples

LSF. bankIWT

Figure 4: LEG(5,3): absolute running times.


1.000E-02

8.000E-03

6.000E-03

4.000E-03

2.000E-03

0.000E+00

Seco

nds

256 512 1024 2048 4096

Number of samples

LSF. bankIWT

Figure 5: DB(9,7): absolute running times.


1.200E-02

1.000E-02

8.000E-03

6.000E-03

4.000E-03

2.000E-03

0.000E+00

Seco

nds

256 512 1024 2048 4096

Number of samples

LSF. bankIWT

Figure 6: SWE(13,7): absolute running times.

contents, turn out to be disadvantaged. The effect on thecomputation of the wavelet transform is that, in general, theconvolution with a long kernel can be optimized more effi-ciently than several convolutions with short kernels. There-fore there is a trade-off, in that the filter bank must perform

Table 8: Ratios between the running times of filter bank schemeand LS.

Filter bank running time/LS running time

Samples LEG(5,3) DB(9,7) SWE(13,7)

256 1.650 1.069 1.969

512 1.678 1.077 1.937

1024 1.657 1.058 1.929

2048 1.679 1.062 1.917

4096 1.607 1.080 1.982

Theoretical value

1.4 1.666 1.833

twice as many operations as the LS with long filters; on theother hand, the use of the pipeline tampers with the LS op-eration, since the factorizations of long filters may consist ofnumerous short filters. The best results, with regard to thegain, are obtained with the SWE(13,7) filter: even though thefilter is long, its factorization consists of only 2 filters, with4 coefficients each.1 The opposite occurs with the DB(9,7)filter, whose factorization consists of 4 filters with 2 coeffi-cients each. The gain that comes from the inferior numberof operations in LS is thus lost in emptying and reloadingthe pipeline. The LEG(5,3) filter has an intermediate behav-ior. Note that, for an increasing number of samples, the ratiobetween the running times of the two algorithms is not con-stant, nor it increases linearly. This behavior is due to the waythe processor manages the cache memory and the data trans-fer from external to internal memory.

4.3. Optimization with DMA

The results in Section 4.2 have shown that the LS is fasterthan the filter bank scheme as for the computation of thewavelet transform. These results have been obtained usingimplementations which demand to the CPU the data trans-fer from the external memory to the CPU itself for perform-ing the convolutions. In the following, we focus on the archi-tectural features of the DSP employed, investigating the pos-sibility of improving the LS performance by exploiting theproperties of the DMA, typically available on a DSP board.

The LS program previously described filters a vector ofcoefficients allocated on a region of external (off-chip) mem-ory. On the other hand, the DSP has a two-level internal (on-chip) cache, with significantly lower access time than the ex-ternal one. The second-level cache (L2) can be configured asinternal memory, and can be used to store and filter the im-age pixels values, with an expected speedup due to the re-duced memory access time.

Since the size of an image is usually larger than the L2size, it is necessary to transfer the data in small blocks from

1It is worth noticing that even higher gains (nearly 3) have been foundwith the SWE(13,7) filter, using a fixed-point implementation that is notaddressed in this paper. This is not surprising, for the upper bound of 2 onthe LS gain [14] is computed in the case of worst case factorization, whilethe SWE(13,7) filter also admits shorter factorizations.


Start

DMA ⇒ transf. to buf0CPU ⇒ filter buf1(simultaneously)

N YEnd image

End image N

Y

DMA ⇒ transf. to buf1CPU ⇒ filter buf0(simultaneously)

End

Figure 7: Ping-pong buffering.

the external to the internal memory. The device used to thispurpose is the DMA controller. In this work, the DMA hasbeen configured so as to transfer a row (or column) of theimage into the on-chip memory, while at the same time theCPU filters the data transferred at the previous step. In thisway, the CPU never accesses the off-chip memory, since boththe stack and the temporary variables are allocated in L2.

The use of L2 as a shared resource between DMA andCPU involves the need of synchronizing these devices. Thedata can be corrupted if the accesses to L2 do not take place inthe correct order. To avoid this problem, we have employedfour software interrupts to regulate the sequence of opera-tions. Moreover, the two concurrent devices are set to operateon two different buffers, which are swapped at each filteringcycle with a ping-pong buffering mechanism (see Figure 7).

We have run this version of the LS on a 512 × 512 grey-scale image, performing one complete 2D decompositionlevel with the DB(9,7) filter. This has led to the results shownin Table 9, where the running time of the standard LS algo-rithm is also reported for comparison. It can be noticed thatthe synchronization of the devices and the reconfiguration ofthe DMA after each transfer leads to a higher running time.In order to make the employment of DMA and L2 advanta-geous, it is necessary to reduce the number of DMA reconfig-urations. This can be done by transferring more than one rowor column at one time. Table 10 shows that transferring, forexample, 2 or 4 rows simultaneously yields an improvementof the LS performance. However, the gain is not as high asexpected, and hardly pays back for the additional complex-ity. The reason for such low gain using the DMA lies in theefficiency of the DSP cache memory.

In fact, every time the CPU needs a datum stored inthe external memory, 32 consecutive bytes are transferredfrom the memory to L1. If the offset between two data pro-cessed sequentially by the CPU is fewer than 32 bytes (i.e.,8 floating-point coefficients), the CPU accesses the memoryonly once, since the second datum will be already cached inL1. The advantage of accessing a faster memory is apparentonly when the weight of the memory accesses is high overall.We assume that the image pixel values are stored in the ex-

Table 9: Comparison between the running times of the LS withoutand with using the DMA.

Absolute running times [in seconds]

Standard LS LS with DMA

1.222 1.651

Table 10: Comparison between the running times of the LS withoutand with using the DMA: transfer of several rows simultaneously.

Absolute running times [in seconds]

Standard LS LS with DMA

2 rows 1.222 1.174

4 rows 1.222 1.166

ternal memory as floating-point values in row major order.As for rowwise filtering, one access to the external memory issufficient to retrieve eight samples of the to-be-filtered data.As far as columnwise filtering is concerned, once a completeimage column has been retrieved from the external memory,the subsequent seven columns are also placed in the cache.2

Moreover, in the specific case of the wavelet transform, mostof the time is spent by the processor in computing the con-volution, that is, sums and products between the filter co-efficients and the image pixel values; the filtering routine iscomputationally heavy, so that the weight of the access op-erations is not very high. Therefore the actual number ofaccesses to the external memory turns out to be quite lim-ited, and their weight on the program running time accord-ingly low. In summary, the performance improvement in thewavelet transform computation, which can be obtained byemploying the DMA, is limited because of the efficiency ofthe on-chip cache.

5. CONCLUSIONS

In this paper, we have addressed the development of waveletcores on a DSP, compatible with the JPEG2000 specifications.The wavelet transform has been implemented according tothe filter bank scheme and the lifting scheme; in this lattercase the integer-transform option has also been considered.The code has been profiled so as to evaluate the efficiency ofthe implementation and, more interestingly, to allow a com-parison between the LS and the filter bank scheme. Moreover,the use of the DMA has also been considered as a possibleway to improve data throughput.

The results have highlighted some aspects of DSP-basedimplementations of the wavelet transform, which are dis-cussed in the following.

(1) The DSP considered in this work is able to computeup to 8 complete 2D one-level wavelet transforms per secondon a grey-scale 256 × 256 image. This figure can be used to

2This holds provided that the cache memory is large enough to storeeight columns, as usually happens in practice.


evaluate the number of JPEG2000 frames that a single DSPis able to code or decode, for example, using the JPEG2000profiling results reported in [17].

(2) A performance comparison between lifting and filterbanks has been carried out. We have found that the LS al-ways runs faster than the filter bank scheme. However, theperformance gain differs from the theoretical results in [14],because the DSP architecture has a different impact on codeoptimization for the two algorithms. In particular, convo-lutions with long filters, which are typical of the filter bankscheme, tend to benefit from the DSP pipelined architecture.On the other hand, the LS gain is higher for long filters. Inthe end, the actual gain heavily depends on the number andlength of the factorized filters used in the LS.

(3) It has turned out that employing the DMA to trans-fer data from the external to the internal memory (and viceversa), while the CPU concurrently filters the previouslytransferred data, may provide very little advantage, if any atall, in terms of execution speed. This is due to the fact that theon-chip cache memory is able to very efficiently manage thedata transfer operations, for both rowwise and columnwisefiltering.

ACKNOWLEDGMENT

This work was partially developed under the Texas Instru-ments Elite program.

REFERENCES

[1] M. Vetterli and J. Kovacevic, Wavelets and subband coding,Prentice-Hall, Englewood Cliffs, NJ, USA, 1995.

[2] M. Antonini, M. Barlaud, P. Mathieu, and I. Daubechies, “Im-age coding using wavelet transform,” IEEE Trans. Image Pro-cessing, vol. 1, no. 2, pp. 205–230, 1992.

[3] D. Lazar and A. Averbuch, “Wavelet-based video coder via bitallocation,” IEEE Trans. Circuits and Systems for Video Tech-nology, vol. 11, no. 7, pp. 815–832, 2001.

[4] D. S. Taubman and M. W. Marcellin, JPEG2000: Image Com-pression Fundamentals, Standards, and Practice, Kluwer Aca-demic Publishers, Dordrecht, Netherlands, 2001.

[5] J. Eyre and J. Bier, “The evolution of DSP processors,” IEEESignal Processing Magazine, vol. 17, no. 2, pp. 43–51, 2000.

[6] B. Fiethe, P. Ruffer, and F. Gliem, “Image processing forrosetta osiris,” in 6th International Workshop on Digital SignalProcessing Techniques for Space Applications, vol. 144, ESTEC,Noordwijk, The Netherlands, September 1998.

[7] J. Eyre, “The digital signal processor derby,” IEEE Spectrum,vol. 38, no. 6, pp. 62–68, 2001.

[8] K. Haapala, P. Kolinummi, T. Hamalainen, and J. Saarinen,“Parallel DSP implementation of wavelet transform in imagecompression,” in Proc. IEEE International Symposium on Cir-cuits and Systems, pp. 89–92, Geneva, Switzerland, May 2000.

[9] B. Yiliang, W. Houng-Jyh, C.-C. J. Kuo, and R. Chung,“Design of a memory-scalable wavelet-based image codec,”in Proc. IEEE International Conference on Image Processing,Chicago, Ill, USA, October 1998.

[10] W. Sweldens, “The lifting scheme: A construction of secondgeneration wavelets,” Siam J. Math. Anal, vol. 29, no. 2, pp.511–546, 1997.

[11] R. C. Calderbank, I. Daubechies, W. Sweldens, and B. Yeo,“Wavelet transforms that map integers to integers,” Applied

and Computational Harmonic Analysis, vol. 5, no. 3, pp. 332–369, 1998.

[12] A. Bilgin, P. Sementilli, F. Sheng, and M. Marcellin, “Scalableimage coding using reversible integer wavelet transforms,”IEEE Trans. Image Processing, vol. 9, no. 11, pp. 1972–1977,2000.

[13] M. Grangetto, E. Magli, and G. Olmo, “Efficient common-core lossless and lossy image coder based on integer wavelets,”Signal Processing, vol. 81, no. 2, pp. 403–408, 2001.

[14] I. Daubechies and W. Sweldens, “Factoring wavelet trans-forms into lifting steps,” J. Fourier Anal. Appl., vol. 4, no. 3,pp. 247–269, 1998.

[15] M. Grangetto, E. Magli, and G. Olmo, “Minimally non-linear integer wavelets for image coding,” in Proc. IEEEInt. Conf. Acoustics, Speech, Signal Processing, Istanbul, Turkey,June 2000.

[16] Document SPRU189F, “TMS320C6000 CPU and instructionsset reference guide,” October 2000, www.ti.com.

[17] M. D. Adams and F. Kossentini, “JasPer: A software-basedJPEG-2000 codec implementation,” in Proc. of IEEE Interna-tional Conference on Image Processing, vol. 2, pp. 53–56, Van-couver, BC, Canada, October 2000.

Stefano Gnavi was born in Biella, Italy, inMarch 1976. He received the degree in elec-trical engineering at Politecnico di Torino,Italy, in July 2001. Since March 2002 he isa researcher under grant with the Centerfor Wireless Multimedia Communications(CERCOM), at the Department of Elec-tronics, Politecnico di Torino. His researchinterests are in the field of image communi-cation, video processing and compression,as well as hardware implementation. Currently he is working onvery low bit rate video coding techniques.

Barbara Penna was born in Castellamonte,Italy, in May 1976. She received the degreein electrical engineering at Politecnico diTorino, Italy, in July 2001. Since September2001 she is a researcher under grant withthe Signal Analysis and Simulation (SAS)group, at the Department of Electronics,Politecnico di Torino. Her research interestsare in the field of data compression. Cur-rently she is working on novel SAR raw datacompression algorithms based on wavelet transforms.

Marco Grangetto received the “summacum laude” degree in electrical engineeringat Politecnico di Torino in 1999, where heis currently pursuing a Ph.D. degree. His re-search interests are in the field of digital sig-nal processing and multimedia communi-cations. In particular, he is working at thedevelopment of efficient and low complex-ity lossy and lossless image encoders basedon wavelet transforms. Moreover, he is chal-lenging the design of reliable multimedia delivery systems for teth-erless lossy packet networking. He was awarded the Premio Optimeby “Unione industriale di Torino” in September 2000, and a Ful-bright grant in 2001 for a research period at the Center for WirelessCommunications (CWC) at UCSD.


Enrico Magli received the degree in elec-tronics engineering in 1997, and the Ph.D.degree in electrical and communicationsengineering in 2001, from Politecnico diTorino, Turin, Italy. He is currently a Post-Doctoral researcher at the same university.His research interests are in the field ofrobust wireless communications, compres-sion of remote sensing images, superreso-lution imaging, and pattern detection andrecognition. In particular, he is involved in the study of compres-sion and detection algorithms for aerial and satellite images, and ofsignal processing techniques for environmental surveillance fromunmanned aerial vehicles. From March to August 2000 he was avisiting researcher at the Signal Processing Laboratory of the SwissFederal Institute of Technology (EPFL), Lausanne, Switzerland.

Gabriella Olmo received the Laurea Degree(cum laude) and the Ph.D. in electronic en-gineering at Politecnico di Torino in 1986and 1992, respectively. From 1986 to 1988she was researcher with CSELT (CentroStudi e Laboratori in Telecomunicazioni),Turin, working on network management,non hierarchical models and dynamic rout-ing. From 1991, she has been Assistant Pro-fessor at Politecnico di Torino, where she ismember of the Telecommunications group and the Image Process-ing Lab. Her main recent interests are in the field of wavelets, re-mote sensing, image and video coding, resilient multimedia trans-mission, joint source-channel coding, stratospheric platforms. Shehas joined several national and international research programsunder contracts by Inmarsat, ESA (European Space Agency), ASI(Italian Space Agency), European Community. She has coauthoredmore than 80 papers in international scientific journals and confer-ence proceedings.


AVSynDEx: A Rapid Prototyping Process Dedicatedto the Implementation of Digital Image ProcessingApplications on Multi-DSP and FPGA Architectures

Virginie FresseCNRS UMR IETR (Institut en Electronique et Telecommunications de Rennes), INSA Rennes,20 avenue des Buttes de Coesmes, CS 14315, 35043 Rennes Cedex, FranceEmail: [email protected]

Olivier DeforgesCNRS UMR IETR (Institut en Electronique et Telecommunications de Rennes), INSA Rennes,20 avenue des Buttes de Coesmes, CS 14315, 35043 Rennes Cedex, FranceEmail: [email protected]

Jean-Francois NezanCNRS UMR IETR (Institut en Electronique et Telecommunications de Rennes), INSA Rennes,20 avenue des Buttes de Coesmes, CS 14315, 35043 Rennes Cedex, FranceEmail: [email protected]


We present AVSynDEx (concatenation of AVS + SynDEx), a rapid prototyping process aiming to the implementation of digitalsignal processing applications on mixed architectures (multi-DSP + FPGA). This process is based on the use of widely availableand efficient CAD tools established along the design process so that most of the implementation tasks become automatic. Thesetools and architectures are judiciously selected and integrated during the implementation process to help a signal processingspecialist without relevant hardware experience. We have automated the translation between the different levels of the process toincrease and secure it. One main advantage is that only a signal processing designer is needed, all the other specialized manualtasks being transparent in this prototyping methodology, hereby reducing the implementation time.

Keywords and phrases: rapid prototyping process, multi-DSP-FPGA architecture, CAD environment, image processing applica-tions.

1. INTRODUCTION

The prolific evolution of telecommunication, wireless andmultimedia technologies has sustained the requirement forthe development of increasingly complex integrated sys-tems. Indeed, digital signal processing applications includ-ing image processing have become more and more com-plex, thereby demanding much greater computational per-formances. This aspect is especially crucial for certain real-time applications. To validate a new technique only func-tionality is not sufficient, the algorithm has to be executedin a limited time. Until then the first approach to meet thisaspect was to optimize the algorithm, a digital signal or im-age designer could do this task. Nevertheless, this solutionwas quickly inadequate, and parallel to the algorithm devel-opment, the implementation aspect must be taken into ac-count. The use of parallel architectures distributes and thenaccelerates the execution time of the application.

Currently, one of the best solutions is mixed platforms in-tegrating a combination of standard programmable proces-sors and a hardware part containing components like FPGA,ASIC, or ASIP. It has been demonstrated in [1, 2, 3] that sucharchitectures are well suited for complex digital image pro-cessing applications: the distribution between both parts isgenerally done by implementing the elementary and regularoperations in the hardware part, the other processing stepsbeing processed by the software part. These platforms candeliver higher performances but this heterogeneous aspectinvolves software and hardware engineering skills.

The result is the rapid execution of an application onsuch architectures but the implementation process becomeslong and is quite complex: several different specialized engi-neers are needed for each part of the platform and this sepa-rate parallel implementation poses the risk that the softwareand hardware designs diverge at the end of the process and

Rapid Prototyping Process for Image Processing Application Implementations on Mixed Architectures 991

lose their initial correlation. Moreover, it is difficult to man-age all the tasks, especially the shared resources and the asso-ciated synchronizations [4, 5].

The negative side of such implementations on com-plex and parallel architecture is the long development timeemerging from the number of specialized people involvedin the process. The signal processing designer does nothave any more the sufficient skills to supervise the com-plete development, the error detection at each level becom-ing more difficult. The intervention of several engineers in-volves a task partitioning between the different parts at thebeginning of the process, and a partitioning modification isvery difficult as another complete implementation is oftenrequired.

Most of computer-aided design (CAD) environmentsdedicated to the implementation on parallel and mixed ar-chitectures are called codesign tools [6, 7, 8] and they inte-grate a manual and arbitrary partitioning based on the de-signer experience. The present codesign tools can addressthe problem of implementation either on multistandard pro-cessors, or one processor and a dedicated hardware, but notboth of them, which is the topic of this paper.

An example of codesign tool is POLIS (which is pre-sented in [9]). This tool is dedicated to embedded systems,which support the control flow description. The representa-tion is CSFM, codesign finite state machine and the advan-tage of this one is its independence with the target architec-ture. There is also Chinook, which is dedicated to reactivereal-time systems (as explained in [10]).

The objective of this work is to propose a full rapid proto-typing process (AVSynDEx) by means of existing academic,commercial CAD tools and platforms. A translator betweenthe CAD environments allows going automatically throughthe process.

The prototyping methodology enables a digital signal orimage-processing designer to create the application with ausual development environment (advanced visual system)and then to supervise the implementation on a mixed ar-chitecture without any other necessary skills. AVSynDEx isopen-ended and can realize the partitioning between soft-ware/hardware target at the highest level of the applicationdescription.

The approach consists of starting with a customary en-vironment used by the digital signal processing developer.Then the integration of a distributed executive generatorSynDEx (synchronised distributed executive) leads to an op-timized implementation on a parallel and mixed platform.The target architecture combines a multi-DSP (digital signalprocessor) part with an FPGA (field programmable gate ar-ray) platform. A main characteristic is the presence of check-ing points at each level of the implementation process ac-celerating the development time: the designer can check andcorrect his design immediately without waiting for the im-plementation. This aspect gives the designer the possibilityto easily and quickly modify the algorithm or to change itsimplementation.

This prototyping process leads to a low production cost:SynDEx is a free academic CAD tool and the multi-DSP and

FPGA board is profitable compared to the development, timeand cost (including raw material, specialized engineers andspecific equipment) for a new platform. Moreover, this pro-totyping process can integrate the new versions of the CADtools and also can use their new performances.

The remainder of this paper is organized into 5 sections.Section 2 gives an overview of the prototyping process bybriefly introducing the CAD environments as well as the tar-get architecture. Section 3 details all these elements. The pro-totyping methodology is fully described by explaining thecompatibility requirements between the levels of the process,and introducing the automatic translator. By way of processillustration, the implementation of an image compression al-gorithm LAR is given in Section 5. Section 6 concludes thepaper.

2. OVERVIEW OF THE PROTOTYPING PROCESS

The prototyping process (Figure 1) aims to a quasi-auto-matic implementation of digital signal or image applica-tions on parallel and mixed platform. The target architec-ture can be homogeneous (multiprocessor part) or hetero-geneous (multi-DSP + FPGA). A real-time distributed andoptimized executive is generated according to the target plat-form.

The digital image designer creates the data flow graphby means of the graphical development tool AVS. This CADsoftware enables the user to achieve a functional validation ofthe application. Then, an automatic translator converts thisinformation into a new data flow graph directly compatiblewith the second CAD tool, SynDEx. This last tool schedulesand distributes the data flow graph according to the paral-lel architecture and generates an optimized and distributedexecutive. This executive is loaded onto the platform by us-ing GODSP, a loader and debugger tool. These tools are quitesimple and the links between them are automatic, accelerat-ing the prototyping process.

The target applications are complex digital image pro-cessing algorithms, whose functions possess different gran-ularity levels. The partitioning consists generally of imple-menting the regular and elementary operations on the FPGApart and the higher-level operations on the multi-DSP part.The prototyping process has the advantage to take this parti-tioning aspect into account and to ensure an adjustable andquickly modifiable decision. The image-processing designeris not restricted to one implementation and the partitioningmodifications are quickly realized.

Three elements are necessary for this prototyping pro-cess: the CAD tools, the target architecture, and the links foran automatic process.

3. PRESENTATION OF THE INTEGRATED CAD TOOLSAND THE MIXED PLATFORM

Several computer-aided design environments are used andjudiciously integrated in the prototyping process. Two mainCAD tools are necessary: AVS for the functional descrip-tion and validation, and SynDEx for the generation of


Dataflow graph AVS Automatic

translatorSynDEx

Sequentialexecutive

WorkstationPC

Sequential executivedistributed executive

GODSP Multi-DSP+

FPGA board

Figure 1: The prototyping process. It consists of one graphical image development tool AVS, SynDEx, which is dedicated to the generationof parallel and optimized executive and a loader-debugger GODSP. The links between these CAD environments are automatic. The data flowgraph is implemented on a multi-DSP + FPGA board.

a distributed and optimized executive. A third tool, a trans-lator between AVS and SynDEx realized the automatic link.

The target architecture is a mixed platform with a multi-DSP part and an FPGA one.

3.1. AVS: advanced visual system

AVS (advanced visual system) is a high-level environment forthe development and the functional validation of graphicalapplications [11]. It provides powerful visualization meth-ods, such as color, shape, and size for accurate informa-tion about data, as shown in Figure 2. The AVS environment(Figure 3) contains several module libraries located on topand a workspace dedicated to the application developments.These algorithms are constructed by inserting existing mod-ules or user modules into the workspace. A module is linkedto its input and output images and their corresponding types.Each module calls a C, C++, or Fortran function and theassociated library files. During a modification of an exist-ing function, the module is immediately updated and thealgorithm as well. All these modules are connected by in-put and output ports to constitute the global application inthe form of a static data flow graph. In the following, weconsider that traded data are mainly images represented asone-dimensional array. AVS includes a subset of visualiza-tion modules for data visualization, image processing, anduser-interface design.

A main advantage is the automatic visualization of inter-mediate and resulting images at the input and output of eachmodule. This characteristic enables the image-processing de-signer to check and validate the functionality of the applica-tion before the implementation step.

3.2. SynDEx

SynDEx is an academic system-level CAD tool [13, 14]. Thisfree tool is an academic environment designed and devel-oped at INRIA, Rocquencourt France and several nationallaboratories take part in this project as we does. SynDEx isan efficient environment, which uses the AAA methodologyto generate a distributed and optimized executive dedicatedto parallel architectures.

AAA stands for algorithm architecture “adequation,”adequation is a French word meaning an efficient match-ing (note that it is different from the English word adequacy,

Figure 2: Examples of AVS applications. Above, the Tracking MoneyLaunderers and below, Bright Forecast at the National Weather Ser-vice [12]. These examples use the color, size, and shape for data vi-sualization.

which involves a sufficient matching) [15]. The purpose ofthis methodology is to find the best matching between onealgorithm and a specific architecture while satisfying con-straints. This methodology is based on graph models to ex-hibit both the potential parallelism of the algorithm andthe available parallelism of the hardware architecture. Thisis formalized in term of graph transformations. Heuristicstake into account execution times, durations of computa-tions, and intercomponent communications are used to op-timize real-time performances and resources allocation ofembedded real-time applications. The result of graph trans-formations is an optimized executive build from a libraryof architecture-dependent executive primitives composingthe executive kernel. There is one executive kernel for each


Figure 3: The AVS environment. Above, the libraries containing the available modules. A rectangle is a defined module. The designer cancreate and insert the modules into those libraries. The red and pink marks represent the input and output; the color indicates the type ofthe ports. Below, the workspace for the algorithm creation. The visualization of an image is done by inserting the Uviewer2D module andconnecting to the target module (here, it is the output module OUT).

supported processor. These primitives support boot load-ing, memory allocation, intercomponent communications,sequentialisation of user supplied computation functions,of intercomponent communications and intersequences syn-chronization.

SynDEx ensures the following tasks [16].(i) Specification of an application algorithm as a con-

ditioned data flow graph (or interface with the compiler ofone of the Synchronous languages ESTEREL, LUSTRE, SIG-NAL through the common format DC). The algorithm is de-scribed as a software graph.

(ii) Specification of the multicomponent architecture asa hardware graph.

(iii) Heuristic for distributing and scheduling the algo-rithm on the architecture with response time optimization.

(iv) Visualization of predicted real-time performancesfor the multicomponent sizing.

(v) Generation of deadlock-free executives for real-timeexecution on the multicomponent with optional real-timeperformance measurement. These executives are built froma processor-dependent executive kernel [17]. SynDEx comescurrently with executives kernels for digital signal proces-sor and microcontroler: SHARC-ADSP21060, TMS320C4x,Transputer-T80X, i80386, i8051, i80C96, MC68332, and forworkstations: UNIX/C/TCP/IP (SUN, DEC, SGI, HP, PCLinux). Executive kernels for other processors can be eas-ily ported from the existing ones. The shared resources andthe synchronizations are taken into account. SynDEx trans-fers images by using static memory whose allocation is op-

timized. The current development work with SynDEx is torefine the communication media and to target the use ofthis tool to mixed architectures (including FPGA and otherASIC). So this evolution will be still coherent for future com-plex products.

The SynDEx environment is shown in Figure 4. The edi-tion view contains two graphs: the hardware architectureabove and the software graph below. The hardware graphrepresents the target architecture with the hardware compo-nents and the physical links whereas the software graph isthe data flow graph of the application: the vertex is a “task”(compiled sequence of instructions), and each edge is data-dependent between the output of an operation and the inputof another task. The vertex is defined by means of informa-tion such as the input and output images, the size and typeof these images, the name of the corresponding C-functionand the time execution. If the execution time is not known,a first random value must be affected for every function. Anyrandom value can be used but a by default value is chosen toobtain an automatic translation and generate the executiveby means of SynDEx. Then, SynDEx generates a first sequen-tial executive on a monoprocessor implementation in orderto determine the real task time.

The granularity level of the graph has an impact on thefinal implementation: many vertices lead to more parallelismbut also increase the data communication cost.

SynDEx generates a timing diagram Figure 5, accordingto the hardware and software graphs. The schedule viewincludes one column for each processor and one line for each


Figure 4: SynDEx CAD software: the workspace. Above, a targetarchitecture with 4 DSP, C4, C3, C2, and root. Root is the DSP,which is dedicated to the video (grab and display image): the inputand output functions are processed by this processor. The physicalconnections are represented. Below, the software graph (algorithm)contains processing tasks (edges) and the data dependencies (ver-tices). Ee and Se are respectively the input and output images.

Figure 5: The SynDEx timing diagram. Each column represents thetask allocation for one DSP (Moy1, Ngr1, and Eros1 are treated bythe DSP C2 and the video modules Ee and Se are effectively imple-mented on the video DSP root). The size of each rectangle is theexecution time for every task and the lines between the columnsindicate the communication transfers. This diagram shows the par-allelism of the algorithm and the timing estimation.

communication medium. The timing diagram describes thedistribution (spatial allocation) and the scheduling (tempo-ral allocation) of tasks on processors, and of interproces-sor data transfers on communication media. Time is flowingfrom top at bottom and the height of each box is propor-tional to the execution duration of the corresponding opera-tion.

SynDEx is an efficient tool for the implementation onparallel architectures but is not an application developmenttool environment as it is not able to simulate the softwaregraph. Without any front-end tool, the signal processing de-signer has to create a sequential C-application by means of

current C-development tools. Once the functional validationis done, it has to be split into several functions for the Syn-DEx data flow graph; each function represents an edge in theSynDEx data flow graph. The resulting data flow graph canbe checked only after the implementation on the target plat-form. The manually transformation of the SynDEx data flowgraph can generate some mistakes. Moreover, optimization isquite long and not automatic: the image processing designerhas to work with the initial algorithm, the SynDEx tool andthe transformations between these two data flow graphs.

3.3. The multi-DSP-FPGA platform

The development of this own parallel architecture is a verycomplex task while suppliers offer powerful solutions. It isthe reason why we have opted for a commercial product. Thechoice of the target architecture has been directed by differentmotivations. Firstly, the platform had to be generic enoughin order to integrate most of the possible image applications.Secondly, it had to be modular to be able to evolve with time.Thirdly, the architecture programming had to be at sufficientlevel in order to be interfaced with SynDEx. Fourthly, the costhad to be reasonable to represent a realistic solution for pro-cessing speedup.

The experimental target architecture is a multi-DSP andFPGA platform based on a Sundance PCI motherboard, Sun-dance Multiprocessor technology Ltd., Chiltern House, Wer-side, Chesham, UK, whose characteristics enable the user toobtain a coherent and flexible architecture.

Two Texas Instrument Modules (TIM) constitute themulti-DSP part [18, 19]. The first one integrates twoTMS320C44 processors to carry out the processing. The sec-ond module is a frame-grabber containing one TMS320C40DSP. When it is not used for a video processing, this DSP canrun image as well.

An additional FPGA part (Figure 6) is designed as areconfigurable coprocessor for TMS320C4x based-systemsand is associated to this multi-DSP platform. This is aMIROTECH X-C436 board [19] integrating a XC4036FPGA, fully compatible with the TIM specifications: thismodule can be directly integrated onto the motherboard.

This FPGA is used as two virtual processing elementscalled VPE. Each VPE is considered as one FPGA XC4013,the rest of the full FPGA being used for the communicationport management: all external transfers between the mod-ules and the multi-DSP architecture are resynchronized by acommunication port manager (CPM). The designers of thisboard propose this solution, which will be used for the pro-totyping process and they give the existing cores (a core is aprocessing task for one VPE) for this configuration. Never-theless, it can be extended to different size of VPE and dif-ferent number of VPE on the sole condition that the image-processing designer or a hardware engineer creates the spe-cific cores.

Each VPE is connected to a C4x processor via direct linksand DMA. The target topology of the platform is shown inFigure 7. This host processor does the FPGA module config-uration and the data transfers. Thus, the use of the dedicatedmodule is straightforward as it consists only in functions calls


CSU

CPM

Comms port JTAG

ILINK

VPE1

VPE2

SRA

M

Figure 6: The X-CIM architecture. The FPGA includes 2 VPE dedi-cated to achieve the image processing. ILINK is a direct link betweenboth. The CPM is responsible for the communications between DSPand FPGA. A supervisor, Configuration and Shutdown Unit, con-trols the clock and the reset.

P2

P1RootVPE1

VPE2

FPGA

Figure 7: The proposed multi-DSP FPGA topology. The multi-DSPpart contains 3 DSP called root, P1, and P2, root being the videoprocessor. The black squares are the physical links between DSP. TheFPGA part is the coprocessor for the root DSP. Each VPE have oneinput and one output, which are connected to the root DSP. TheFPGA part (grayed) is fully transparent to the user and the functionsare managed by the root processor.

inside the DSP code. A configuration step is necessary be-fore the processing step. The configuration step includes theinitialization of the module (link specifications between theVPE and DSP, license affectation), the parameters for eachcore (data size, image size, . . . ), core assignations for eachVPE and the module configuration (implementation of allprevious configurations on the board).

Afterwards, the processing can be achieved in 3 possibleways.

(i) Transparent communication mode. The first solutionconsists of using one instruction corresponding to the targetprocessing. The input and output images are the only nec-essary parameters: this only instruction ensures to send theinput images to FPGA and then to receive the output imagesat the end of the execution. This unique instruction is easy touse but prevents the processor running a parallel function.

(ii) Low-level communication mode. In a second ap-proach, the user gives the input and output image by usingsome pointers and sends the images pixel per pixel, whichare immediately processed and then sent back. With thismethod, a function is time-consuming and the processorcannot run another function in the same time.

(iii) DMA communication mode. The last way consists ofsending the input image via the DMA. Specific instructionsenable the designer to associate the image, to read and writein the DMA, and to wait for the end of the processing. Theadvantage is that the processor can execute another processat the same time.

For all these processing possibilities, specific librariescontain cores and specific instructions for the FPGA config-uration and use. The library declaration is inserted in thesefunctions.

The configuration and processing tasks can be separatedand included in different functions. The configuration timeof the dedicated module is long (about 2.6 seconds), limitingto a static use of the coprocessor to two type of operation.

4. PROTOTYPING METHODOLOGY FOR MIXEDAND PARALLEL ARCHITECTURES

AVSynDEx is a prototyping process aiming to go automati-cally from the functional AVS description to the distributedexecution over the multi-DSP or mixed architecture (seeFigure 1). It implies first to guaranty a full compatibility be-tween the elements throw the process, and then to generateautomatic links. The general requirements for a multi-DSPimplementation are listed first, before the specific ones linkedto the dedicated target.

4.1. Compatibility for multi-DSP architectures

4.1.1 SynDEx-multi-DSP platform

SynDEx can handle the multi-DSP architecture once the syn-chronization primitives, memory management, and com-munication schemes have been realized for the type of pro-cessors involved. Architecture configurations such as framegrabber initialization are gathered in an INIT file executedonce at the beginning of the application running.

4.1.2 AVS-SynDEx

SynDEx and AVS present a similar semantic in terms of ap-plication description with the use of static data flow graph.Nevertheless, some particularities have to be dealt with.

Restrictions in SynDEx description

Only a few data types are defined in SynDEx (Boolean, inte-ger, real, . . . ), and the dimension of the arrays must be fixed.The same rules have to be applied for the AVS graphs.

C-functions associated to processing vertices

For both graphs, each vertex can be associated to a C-function. A skeleton of the function is generally created byAVS when editing a new module, containing specific AVS in-structions to interface the user code to the global applica-tion. To be compiled in the SynDEx environments, all theseinstructions have to be removed.

Specific vertices

In and Out functions are of course dependent on the plat-form. In the AVS environment, IN and OUT correspond to


h-files

c-files

AVS dataflow graph

Automatictranslator

SynDExsoftware

graph

SynDExhardware

graph

DSPconfiguration

file

h-files

c-files cores

Configurationfile

Figure 8: Presentation of the automatic translator.

read and write image files, whereas it is linked to video cap-ture and display for SynDEx.

Besides the processing vertices, SynDEx defines also threespecific ones:

(i) Memory: a storage element acting as a FIFO whosedeep is variable.

(ii) When: allows to build a conditional graph (the follow-ing of the graph is executed if an input condition isasserted).

(iii) Default: selects between two inputs the one to be trans-mitted depending on an input condition.

The corresponding AVS primitives have been designed inV (the low-level language of AVS) to keep the whole SynDExpotential of graphs management.

4.2. Compatibility specific to FPGA module

4.2.1 SynDEx-Mixed platform

As the FPGA module management is carried out by a hostDSP, the adopted solution consists of representing the useof a VPE in the data flow graph by a dedicated vertexlinked to the host. The associated function contains onlythe reference to the core according to the transparent com-munication mode. The essential advantage is that the dataflow graph remains unchanged compared to the multi-DSPcase: whatever the target architecture is (software or hard-ware), a task is specified by its inputs, outputs, and executivetime.

FPGA module configuration is also stored in the globalINIT file.

4.2.2 AVS-SynDEx

In order to get equivalent functional graph, a functionalequivalent C-function has to be developed for each availablecore, gathered in a library. It is an easy task as the low-leveltreatments corresponds generally to simple algorithms. Fora multi-DSP architecture only, the function can be directlyreused and implemented into a DSP. When using the FPGAmodule, it has to be replaced by the call to the core.

4.3. The automatic translator

The fulfillment of the compatibility between the prototyp-ing process stages allows to go from the functional descrip-tion to the parallel implementation. By designing a translator

between AVS and SynDEx, the process is performed auto-matically.

The automatic translator is designed with the Lex andYacc tools. The first one filters the necessary parts in a se-quence whereas the second one transforms an input chaininto another one.

The translator realizes the following tasks, as shown inFigure 8:

(i) Transforms the AVS data flow graph syntax into a Syn-DEx one.

(ii) Looks for user c-files and h-files associated to eachmodule and “cleans” them of specific AVS instruc-tions.

(iii) Transforms Memory, When, and Default primitivesinto SynDEx ones.

(iv) Generates the constraints (e.g., IN and OUT linked tothe host DSP).

(v) Adds automatically the target architecture (hardwaregraph).

(vi) Generates the INIT file for the multi-DSP configura-tion.

Moreover, the translator is a key element in the codesignprocess. A flag is associated to each core equivalent AVS mod-ule that indicates if the target is a DSP or the hardware mod-ule. In the first case, the c-file and the h-files are fetched andreused for the generation of the executive. In the second case,these files are replaced with the corresponding core and theFPGA configuration is added to the INIT file. Thus, the al-location/partitioning tasks are easily done in the functionalenvironment.

Another field of the AVS modules contains the executiontime of the operation if it is known (otherwise a randomvalue is assigned). The timing information is not needed inthe AVS description and is inserted in the module as a com-ment. This information is not used to determine the overallfunctionality of the AVS description: AVS does not do anydifference between two similar C-functions whose timinginformation is different. Nevertheless, the image processingdesigner can decide what partitioning (software/hardwaremodule) is more efficient thanks to this timing information.Another reason is the use of this information in the SynDExdata flow graph; the automatic translator needs this informa-tion to generate the SynDEx data flow graph. This feature hasgenerally already been determined for the cores. This time isalso copied out on SynDEx.


AVS

Specifications

Data flow graph

Data flow graph

SynDEx

Sequential executive

DSP implementations

Timing measurement User

User

Allocation/Partitioning

Data flow graph

Data flow graph

Spatial and temporal scheduling

Distributed executive

FPGA+multi-DSPimplementation

C-functions

cores

GODSP

Figure 9: The AVSynDEx prototyping methodology for mixed and parallel architecture. The starting point is the specification for a data flowgraph creation. Two runs of the implementation process are necessary: the first one is the chronometrical report by means of a sequentialexecutive (left part). It is necessary only for new C-functions and is removed in the other case. The second one is the implementation on themixed platform. The links between the CAD tools are automatic and the designer supervises all the implementation steps.

4.4. Prototyping process

The implementation process is simple, requiring only a max-imum of two development presented in Figure 9. For newuser C-modules, their executive time has first to be deter-mined to get eventually an optimized parallel implementa-tion. It is done by first considering a mono-DSP target, whenthe user constraints all the tasks of the software graph to beassociated to the root processor. The executive generated bySynDEx is at this step only sequential. The loader GODSP en-sures the implementation of the application and the report ofthe chronological information. Then, the designer has onlyto copy out these times on the AVS modules. If the applica-tion is made of already valued C-modules, this first run ofthe process is of course useless.

Once the algorithm is functionally validated and the par-titioning is decided by the designer, the automatic transla-tor generates the new SynDEx description associating the C-functions and the cores. From now on, the hardware graphis the multi-DSP architecture. SynDEx schedules and dis-tributes the algorithm and gives the resulting timing dia-gram. The user can choose to modify the partitioning in AVSor run the application on the mixed platform.

The main advantage of this prototyping process is its sim-plicity, as most of the tasks realized by the users concern theapplication description with his conventional environment.The required knowledge of SynDEx and the loader are lim-ited to simple operations.

Other front-end development tools can be used in theprocess instead of AVS so far as they present a similar seman-tic for the application description. Ptolemy-related works can

be found in [20, 21]. AVSynDEx can be adapted to other ar-chitectures as well.

5. IMPLEMENTATION OF AN IMAGE COMPRESSIONALGORITHM

A new image compression algorithm has been developed inour laboratory: its implementation on a mixed architectureprovides a validation of our fast prototyping methodology.This algorithm called LAR, locally adaptive resolution [20],is an efficient technique well suited for image transmissionvia Internet or for embedded systems. Basically, the LARmethod was dedicated to gray levels still image compression,but extensions have been also proposed for colour imagesand videos [22].

5.1. Principle of the compression

The basic idea of the LAR method is that the local resolu-tion (pixel size) can depend on the activity: when the lu-minance is locally uniform, the resolution can be low (largepixel size). When the activity is high, the resolution has to befiner (smaller pixel size).

A first coder is an original spatial technique and achieveshigh compression ratio. It can be used as a stand-alone tech-nique, or complemented with a second coder allowing to en-code the error image from the first coder topology descrip-tion. This second one is based on an optimal block-size DCT-transform. This study concerns only the first spatial coder.Figure 10a presents its global process.

The image is first subsampled by 16 × 16 squares rep-


Source image

Nonuniformsubsampling

Grid

Blocksaverage

Gray-level blocks

Blocksquantization

Diferentialentropiccoding

Entropiccode

Compressed image

(a)

Source image

Erosion3 × 3

Dilation3 × 3

Erosion3 × 3

Dilation3 × 3

Erosion3 × 3

Dilation3 × 3

Stationarywithin

3 × 3 blocks

Stationarywithin

5 × 5 blocks

Stationarywithin

17 × 17 blocks

< T

< T

< T

+−

+−

+−

(b)

Figure 10: (a) Global scheme of the spatial LAR coder. (b) Decom-position of the nonuniform function.

resenting local trees. Then, each one is split according to aquadtree scheme depending on the local activity (edge pres-ence). The finest resolution is typically 2 × 2 squares. Theimage can be reconstructed by associating to each square thecorresponding average luminance in the source image.

The image contents information given through thesquare size is considered advantageous for the luminancequantization. Large squares require a fine quantization, asthey are located in uniform area (strong sensitivity of humaneye to brightness variations). Small ones support a coarsequantization as they are upon edges (low sensitivity). Sizeand luminance are both encoded by an adaptive arithmeticentropic encoder. The average cost is less than 4 bits persquare.

5.2. Functional description of the applicationby means of AVS

In order to obtain the best implementation of the data flowgraph on the mixed architecture, the image processing de-signer has to exhibit both elementary operations available inthe core library and additional data parallelism allowed bysome tasks. All the decisions and modifications are achievedonly at the functional level (AVS data flow representation).

In the LAR method, block stationary property is evalu-ated by a morphological gradient followed by a threshold. Amorphological gradient is defined as the difference betweenthe dilated value (maximal value in a predefined neighbour-hood) and the eroded value (minimal value in the sameneighbourhood). A low resulting value indicates a flat region.A high ones show off an edge in the neighbourhood. By com-puting this stationary estimation using growing neighbour-hood surface (2 × 2, 4 × 4, 8 × 8, and 16 × 16), it is possi-ble to choose the maximal block size to represent the regionwhile keeping the stationary property. The major drawbackof this approach is that the morphological operators com-plexity is proportional to the neighbourhood size, and thenan erosion/dilation upon a 16 × 16 block requires 256 op-erations per pixel. To reduce the complexity, one uses gen-erally the Minkowski addition by performing an operationupon a large neighbourhood as successive operations uponsmaller neighbourhood [23]. As 3 × 3 erosion and dilationoperators are available in the core library, the graph modifi-cations at this stage have consisted of decomposing the globalmorphological operations into iterative elementary ones (seeFigure 10b).

Data parallelism has been also pointed out for a mul-tiprocessing purpose as most of the other functions are lo-calised into 16× 16 blocks.

The algorithm development and optimizations areachieved using the AVS tool. The designer can develop theapplication and can easily refine the granularity of severalfunctions. The data flow graph algorithm of the LAR appli-cation is developed and the resulting AVS data flow graph isshown in Figure 11. AVS enables the image processing de-signer to check the functionality of the new algorithm asshown in Figure 12.

5.3. Implementation on the multi-C4x-FPGA platform

According to the presented prototyping process, the auto-matic translator generates the corresponding SynDEx dataflow graph and the associated files. A first monoprocessorimplementation is required for chronological measurementsand the modules are specified to be software modules. Syn-DEx generates a sequential executive for the implementationon the root processor. The designer does the chronometricalreports (Table 1) and inserts these new times in the AVS dataflow.

The time corresponding to a C4x-processor implementa-tion is 3.21 seconds and represents the reference for the par-allel one.

The second choice of architecture is to use only threeC4x-DSP as represented in Figure 13. The modification con-


Figure 11: Presentation of the LAR algorithm under the AVS en-vironment. On top, the new modules are inserted into libraries(right). Below, the data flow graph of the LAR application. The out-put image is visualized by means of the Uviewer2D module: it is theLena image.

sists only of removing the constraint of tasks allocation toonly one DSP in SynDEx.

The best distribution according SynDEx is using only 2C4x-processors (root and P1). This timing diagram showsthat the global time should be longer or not more efficient incase of a 3DSP implementation. The resulting time for thisimplementation is 1.74 seconds.

As the 3 × 3 erosion and dilation cores are available, thelast solution consists in the use of the FPGA to perform theseoperations. A comparison with the software implementa-tion shows that the hardware one is approximately 100 timesfaster. Note that the architecture limits the number of coresin an application but theses cores can be used several times(several identical vertices in the graph). Changing the tar-get flag of the equivalent AVS modules and running againthe translator leads to a new SynDEx input file. The Er3∗3and Dil3∗3 C-functions are replaced by the call of the cor-responding cores. The SynDEx data flow graph remains un-changed except new constraints on the FPGA tasks allocatedto the host (root) processor. Then, SynDEx can schedule anddistribute the application according to the new software andmaterial graphs. The timing diagram is almost the same thanFigure 14 except erosion and dilation tasks, which are muchsmaller. The resulting executive time is 245 milliseconds.

5.4. Implementation on a multi-C6x platform:AVSynDEx version 2

The new Sundance multiprocessor architecture is now avail-able and an upgrade of AVSynDEx (version 2) to this new ar-

Table 1: Modules execution time (image 120× 120 pixel).

FunctionsTime (microsecond)

DSP C4X FPGA DSP C6X

IN 50 648

Er3∗3 371 400 (∗4) 4 176 (∗4) 18 551 (∗4)

Dil3∗3 370 681 (∗4) 4 150 (∗4) 18 410 (∗4)

GrStep4 6 777 562

GrStep8 5 967 636

BlAver 7 558 11 124

DPCM 33 886 14 721

SiZcod 71 586 3 243

GrayCod 64 275 3 193

FilGray 2 816 232

FilSiz 2 859 243

OUT 12 141

(a) Original image(512∗512, 8 bits perpixel).

(b) Nonuniform grid.

(c) Reconstructed image. (d) Reconstructed imageafter postprocessing(0.18 bits per pixel, PSNR28.5 dB).

Figure 12: Visualization of images in the AVS environment. Imagedisplay is available via an Uviewer2D module.

chitecture is in progress. The platform consists of a SundanceSMT320 motherboard with two TIM SMT335. Each modulecontains a TMS320C6201 processor (the clock frequency be-ing 200 MHz) and one FPGA dedicated to the communica-tion management between both processors.


Table 2: Comparisons of different ameliorations given by the prototyping process. The proposed prototyping process improves the devel-opment time and is friendlier. The optimizations and functional validation ensure to improve the application description and to obtain arapid implementation. The partitioning is easier as the SynDEx tool is not being friendly for such modifications.

Full development Translation AVS-SynDEx Functional validation Partitioning Error detection

Without 3–4 days 60 min No SynDEx No

With 1 day 5 min Immediate AVS Immediate

Figure 13: The generated SynDEx description. The architecture isadded (3DSP: root, P1, and P2). The software graph is similar to theAVS data flow graph.

Actually, SynDEx does not completely ensure the gener-ation of an optimized and distributed executive of the algo-rithm for this new architecture. Indeed, current works areperformed in our laboratory on the generation of the exec-utive integrating conventional features, that is, the descrip-tion of primitives for DSP synchronization, data exchange viathe use of DMA and communication buses. Parallel to thisare added new features such as shared memory, conditionalnodes, . . . .

Nevertheless, the prototyping process remains similarand the development stages as well. The translation betweenAVS and SynDEx is identical: the modification lies only inthe generation of the distributed executive by SynDEx for thetarget platform.

The LAR application is reimplemented on a one-DSP ar-chitecture and the chronometrical reports are presented inthe right-hand column of Table 1. The result for a sequentialexecution time is 245.331 milliseconds on a C6x DSP, that is,a rough accelerating factor of 13 only by integrating new andfaster processors.

Several observations can be made:(i) Most of the software functions are faster with the use

of the C6x DSP. So, without changing the rapid prototypingprocess, the implementation time will be improved only byintegrating new and efficient components.

(ii) The execution time of initial hardware implemen-tations (i.e., Er3∗3 and Dil3∗3) is not improved in caseof a software implementation and the hardware integration

Figure 14: The timing diagram generated by SynDEx. The best im-plementation only uses 2DSPs (root and P1): Er3∗3 are treated bythe P1 processor and Dil3∗3 by the root processor. These functionsare executed at the same time. The root processor executes most ofother functions. SynDEx indicated (on top) that the efficiency is 1.8compared to a sequential executive.

remains the best solution. A mixed DSP-FPGA architecturewill always be one efficient platform for the implementationof digital image processing with real-time constraints.

5.5. Results

For this application, the executive time is 3.21 seconds on aone-DSP implementation, and 245 milliseconds for a multi-DSP architecture (leading to an accelerating factor of about13).

Our methodology ensures a fast and controlled prototyp-ing process, and a final optimized implementation.

The development time of such applications, Table 2, isvalued to one day (when different scenario are tested) withAVSynDEx and its automatic translator, and 3–4 days with-out this one. This estimation is based on the hypothesis thatthere is at least one mistake in the development process andthe times integrate the detection and correction of this mis-take. The estimations are the result of personal implementa-tions of complex image processing applications combined tothe experience of designers working in the same laboratory.All of them have a huge experience in the AVS environment.

The main work consists of describing the application un-der the AVS environment and creating the new C-modules.The implementation process is very fast and secured: thetime for the chronometrical stage is about 20 minutes,


starting from the automatic generation of the sequential ex-ecutive to the final timing results. Without the automatictranslator, the generation of the SynDEx data flow graph lasts1 hour. It is an average time so far as it depends on the appli-cation size (number of vertices). The remaining of the imple-mentation process is very fast (15 minutes).

6. CONCLUSION AND PERSPECTIVES

We have presented AVSynDEx, a rapid prototyping processable to implement complex signal/image applications on amulti-DSP+FPGA platform. AVSynDEx is currently the onlyenvironment able to target this kind of architecture froma high-level functional description. The methodology inte-grates two CAD tools: AVS for the functional development ofthe application described as a static data flow graph, and Syn-DEx as generator of optimized distributed executive. SynDExis a powerful tool to find the best matching between an ap-plication and a specific architecture, but does not constitute adevelopment environment of algorithms. Adding a front-endone and developing an automatic link between them intro-duce a higher level of abstraction in the process. Moreover,SynDEx can only handle processors but not dedicated hard-ware (even if some works in this sense are in progress). Byselecting a suitable FPGA-based target and adapting its man-agement to the SynDEx description type, we have removedthis limitation. The result is a fast and easy-to-use process.The image designer can develop and supervise the whole im-plementation process without any pre-requirement as differ-ent complex and specific stages become transparent.

A main characteristic is the opening of this process. It be-comes very easy to use other CAD tools or to update the usedenvironments. The structure of the methodology ensures toreplace the AVS environment with other graphical applica-tion development tools such as Ptolemy. The application de-scription of this new tool should just present a similar seman-tic with SynDEx (static data flow graph). The target platformitself can integrate the components and the number of FPGAand DSP is not set.

We offer a low-cost solution considering that the front-end environment is necessary for a high-level perfecting ofimage applications, and that SynDEx is a free academic tool.Moreover, the prototyping targets relatively cheap platformsamong existing multicomponents architectures.

Works in progress concern the integration of the newversions of AVS and SynDEx (introducing the notion of dy-namic data flow graph) as well as the interface to a new ar-chitecture based on several TI C6x and FPGA Virtex. In par-ticular, we are developing a new SynDEx executive kernel forthese DSPs. On the same time, we are developing an Mpeg-4 coder with AVS, which should be integrated into the newplatform thanks to AVSynDEx, in order to reach real-timeperformances. An Mpeg-2 coder has already been developedand implemented on the multi-C4x-FPGA platform [24].

Another perspective is the integration of new tools suchArtBuilder [25] or DK1 Design Suite [26] to facilitate thecreation of new cores for the FPGA by nonspecialists inhardware. These tools can generate VHDL code or the target

core starting from a C description, or a similar c-description,such as Handle-C for the DK1 Design Suite.

REFERENCES

[1] A. Downton and D. Crookes, “Parallel architecture for imageprocessing,” Electronics & communications Engineering Jour-nal, vol. 10, no. 3, pp. 139–151, June 1998.

[2] N. M. Allinson, N. J. Howard, A. R. Kolcz, et al., “Image pro-cessing applications using a novel parallel computing machinebased on reconfigurable logic,” in IEE Colloquium on ParallelArchitectures for Image Processing, pp. 2/1–2/7, 1994.

[3] G. Quenot, C. Coutelle, J. Serot, and B. Zavidovique, “Imple-menting image processing applications on a real-time archi-tecture,” in Proc. Computer Architectures for Machine Percep-tion, pp. 34–42, New Orleans, La, USA, December 1993.

[4] Q. Wang and S. G. Ziavras, “Powerful and feasible proces-sor interconnections with an evaluation of their communi-cations capabilities,” in Proc. 4th International Symposium onAlgorithms and Networks, pp. 222– 227, Freemantle, Australia,June 1999.

[5] M. Makhaniok and R. Manner, “Hardware synchronization ofmassively parallel processes in distributed systems,” in Proc.3rd International Symposium on Parallel Architectures, Algo-rithms and Networks, pp. 157–164, Taipei, Taiwan, December1997.

[6] G. Koch, U. Kebschull, and W. Rosenstiel, “A proto-typing environment for hardware/software codesign in theCOBRA project,” in Proc. 3rd International Workshop onHardware/Software Codesign, pp. 10–16, Grenoble, France,September 1994.

[7] B. K. Seljak, “Hardware-software co-design for a real-time ex-ecutive,” in Proc. IEEE International Symposium on IndustrialElectronics, vol. 1, pp. 55– 58, Bled, Slovenia, 1999.

[8] R. K. Gupta, “Hardware-software co-design: Tools for archi-tecting systems-on-a-chip,” in Proc. Design Automation Con-ference, pp. 285– 289, Makuhari, Japan, January 1997.

[9] F. Balarin, D. Chiodo, M. and Engels, et al., “POLIS a De-sign Environment for Controldominated Embedded Systems,version 3.0,” User’s manual, December 1997.

[10] Department of Computer Science and Engineering, “TheChinook project,” Tech. Rep., University of Washington, Seat-tle, Wash, USA, May 1998, http://cs.washington.edu/research/chinook/.

[11] Advanced Visual Systems Inc., “Introduction to AVS/Express,”Official site http://www.avs.com, 1996.

[12] R. O. Cleaver and S. F. Midkiff, “Visualization of networkperformance using the AVS visualization system,” in Proc.2nd International Workshop on Modeling, Analysis, and Sim-ulation of Computer and Telecommunication Systems, pp. 407–408, Durham, NC, USA, 31 January–2 February 1994.

[13] C. Lavarenne, O. Seghrouchni, Y. Sorel, and M. Sorine, “TheSynDEx software environment for real-time distributed sys-tems design and implementation,” in Proc. European ControlConference, pp. 1684–1689, Grenoble, France, July 1991.

[14] C. Lavarenne and Y. Sorel, “Specification, performance op-timization and executive generation for real-time embeddedmultiprocessor applications with SynDEx,” in CNES Sympo-sium on Real-Time Embedded Processing for Space Applications,Les Saintes Maries de la Mer, France, November 1992.

[15] C. Lavarenne and Y. Sorel, “Real time embedded image pro-cessing applications using the A3 methodology,” in Proc.IEEE International Conference on Image Processing, pp. 145–148, Lausanne, Switzerland, November 1996.

[16] T. Grandpierre, C. Lavarenne, and Y. Sorel, “Optimized rapidprototyping for real-time embedded heterogeneous multi-


processors,” in Proc. 7th International Workshop on Hard-ware/Software Co-Design, pp. 74–78, Rome, Italy, May 1999.

[17] A. Vicard and Y. Sorel, “Formalization and static optimizationof parallel implementations,” in Workshop on Distributed andParallel Systems, Budapest, Hungary, September 1998.

[18] Sundance Inc., “SMT20 4 slots TIM,” http://www.sundance.com/s320.htm, 2000.

[19] Sundance Inc., “SMT314 video grab and display TMS320C40TIM,” http://www.sundance.com/s314.htm, 2000.

[20] J. Lienard and G. Lejeune, “Mustig: a simulation toolin front of the SynDEx software,” in Thematically DaysUniversity-Industry, GRAISyHM-AAA-99, pp. 34–39, Lille,France, March 1999.

[21] V. Fresse, R. Berbain, and O. Deforges, “Ptolemy as front endtool for fast prototyping into parallel and mixed architecture,”in International Conference on Signal Processing ApplicationsTechnology, Dallas, Tex, USA, October 2000.

[22] O. Deforges and J. Ronsin, “Nonuniform sub-sampling usingsquare elements: a fast still image coding at low bit rate,” inInternational Picture Coding Symposium, Portland, Ore, USA,April 1999.

[23] H. Minkowski, “Volumen und Oberflache,” Math. Ann., vol.57, pp. 447–495, 1903.

[24] J. F. Nezan, V. Fresse, and O. Deforges, “Fast prototypingof parallel architectures: an Mpeg-2 coding application,” inThe 2001 International Conference on Imaging Science, Sys-tems, and Technology, Las Vegas, Nev, USA, June 2001.

[25] M. Fleury, R. P. Self, and A. C. Downton, “Hardware compi-lation for software engineers: an ATM example,” IEE Proceed-ings Software, vol. 148, no. 1, pp. 31– 42, 2001.

[26] T. Stockein and J. Basig, “Handel-C: an effective method fordesigning FPGA (and ASIC),” Academic paper, University ofApplied Science, Nuremberg, 2001, http://www.celoxica.com/products/technical papers/index.htm.

Virginie Fresse received the Ph.D. degree inelectronics from the Institut of Applied Sci-ences of Rennes, INSA, France in 2001. Sheis currently a postdoctoral researcher in theDepartment of Electrical Engineering in theUniversity of Strathclyde, Glasgow, Scot-land. Her research interests include the im-plementation of real-time image-processingapplications on parallel and mixed architec-tures, the development of rapid prototypingprocesses and the codesign methodologies.

Olivier Deforges graduated in electronicengineering in 1992, from the Polytech-nique University of Nantes, France, wherehe also received in 1995 a Ph.D. degree inimage processing. Since September 1996, hehas been a lecturer in the Department ofElectronic Engineering at the INSA RennesScientific and Technical University. He is amember of the UMR CNRS 6164 IETR lab-oratory in Rennes. His principal researchinterests are parallel architectures, image understanding and com-pression.

Jean-Francois Nezan received his postgrad-uate certificate in Signal, Telecommuni-cations, Images and Radar Sciences fromRennes University in 1999, and his MSI inelectronic and computer engineering fromINSA-Rennes Scientific and Technical Uni-versity in 1999, where he is currently work-ing toward a Ph.D. Research interests in-clude image compression algorithms andrapid prototyping.

implementation of dsp and communication...

Documents