domain-specific processors - s. bhattacharyya et al., (marcel dekker, 2004)

Domain-Speci$c Processors Systems, Architectures, Modeling, and Simulation edited by University of Maryland College Park, Maryland, U.S.A. SHUVRAS. BHATTACHARYYA ED F. DEPRETTERE LeidenUniversity Leiden, The Netherlands JURGENTEICH University ofErlangen-Nuremberg Nuremberg, Germany Copyright2004 by Marcel Dekker, Inc. All Rights Reserved.Althoughgreat carehasbeentakentoprovideaccurateandcurrent information,neithertheauthor(s)northepublisher,noranyoneelseisassociatedwiththispub-lication, shall be liable for any loss, damage, or liability directly or indirectly caused orallegedtobecausedbythisbook.Thematerialcontainedhereinisnotintendedtoprovide specic advice or recommendations for any specic situation.Trademark notice: Product or corporate names may be trademarks or registered trade-marks and are used only for identication and explanation without intent to infringe.Library of Congress Cataloging-in-Publication DataA catalog record for this book is available from the Library of Congress.ISBN: 0-8247-4711-9This book is printed on acid-free paper.HeadquartersMarcel Dekker, Inc., 270 Madison Avenue, New York, NY 10016, U.S.A.tel: 212-696-9000; fax: 212-685-4540Distribution and Customer ServiceMarcel Dekker, Inc., Cimarron Road, Monticello, New York 12701, U.S.A.tel: 800-228-1160; fax: 845-796-1772Eastern Hemisphere DistributionMarcel Dekker AG, Hutgasse 4, Postfach 812, CH-4001 Basel, Switzerlandtel: 41-61-260-6300; fax: 41-61-260-6333World Wide Webhttp://www.dekker.comThe publisher oers discounts on this book when ordered in bulk quantities. For moreinformation, writetoSpecial Sales/Professional Marketingattheheadquartersad-dress above.Copyright nnnn 2004 by Marcel Dekker, Inc. All Rights Reserved.Neither this book nor any part may be reproduced or transmitted in any form or byanymeans, electronic or mechanical, includingphotocopying, microlmimg, andrecording, or by any information storage and retrieval system, without permission inwriting from the publisher.Current printing (last digit):10 9 8 7 6 5 4 3 2 1PRINTED IN THE UNITED STATES OF AMERICACopyright2004 by Marcel Dekker, Inc. All Rights Reserved.Signal Processing and Communications Editorial Board Maurice G. Bellanger, Consetvatoire National des Arts et Metiers (CNAM), Paris Ezio Biglieri, Politecnico di Torino, ltaly Sadaoki Furui, Tokyo institute of Technology Yih-Fang Huang, University of Nofre Dame Nikhil Jayant, Georgia Tech University Ag ge I0s K . Katsa gge I 0s , North western University Mos Kaveh, University of Minnesota P. K. Raja Rajasekaran, Texas lnstruments John Aasted Sorenson, IT University of Copenhagen 1. 2. 3. 4. 5 .6. 7. 8. 9. 10. 11. 12. Digital Signal Processing for Multimedia Systems, edited' by Keshab K. Parhi and Taka0 Nishitani Multimedia Systems,Standards, and Networks,editedb yAtulPuri andTsuhan Chen EmbeddedMultiprocessors: SchedulingandSynchronization,Sun- dararajan Sriram and Shuvra S. Bhattacharyya Signal Processing for Intelligent Sensor Systems, DavidC. Swanson Compressed Video over Networks, edited by Ming-Ting Stun and Amy R. Reibman Modulated Coding for Intersymbol Interference Channels, Xiang-Gen Xia Digital Speech Processing, Synthesis, and Recognition: Second Edi- tion, Revised and Expanded, Sadaoki Furui Modem Digital Halftoning, Daniel L. Lau and Gonzalo R.Arce Blind Equalization and Identification, Zhi Ding andYe (Geoffrey) Li Video Coding for Wireless Communication Systems, King N. Ngan, Chi W. Yap, and Keng T. Tan AdaptiveDigitalFilters:SecondEdition,RevisedandExpanded, Maurice G. Bellanger Design of Digital Video Coding Systems, JieChen,Ut-Va Koc,and K. J. Ray Liu Copyright2004 by Marcel Dekker, Inc. All Rights Reserved.13. 14. 15. 16. 17. 18. 19. 20. ProgrammableDigitalSignalProcessors:Architecture,Program- ming, and Applications, edited by Yu Hen Hu PatternRecognitionandImagePreprocessing: Second Editioln, Re- vised and Expanded, Sing-Tze Bow SignalProcessingforMagneticResonanceImagingandSpectros- copy, edited by HongYan Satellite Communication Engineering, Michael 0. Kolawole SpeechProcessing:ADynamicandOptimization-OrienteldAp- proach, Li Deng and DouglasO'Shaughnessy Multidimensional Discrete Unitary Transforms: Representatioin, Par- titioning, and Algorithms, Artyom M. Grigoryan and Sos A. Agaian High-Resolution and RobustSignal Processing, Yingbo Hua, Alex B. Gershman, and Qi Cheng Domain-SpecificEmbeddedMultiprocessors:Systems,Architec- tures,Modeling,andSimulation,ShuvraBhattacharyya,EdDe- prettere, and JurgenTeich AdditionalVolumes in Preparation Biosignal andBiomedicalImageProcessing: MATLAB-Based Ap- plications, John L. Semmlow Watermarking Systems Engineering: Enabling DigitalAssetsSecu- rity and Other Applications, Mauro Barni and Franco Bartolini Image Processing Technologies: Algorithms, Sensors, and Applica- tions, Kiyoharu Aizawa, KatsuhikoSakaue,Yasuhito Suenaga Copyright2004 by Marcel Dekker, Inc. All Rights Reserved.SeriesIntroductionOver the past 50years, digital signal processinghas evolvedas amajorengineeringdiscipline.Theeldsofsignalprocessinghavegrownfromtheorigin of fast Fourier transform and digital lter design to statistical spectralanalysis and array processing, image, audio, and multimedia processing, andshapeddevelopments inhigh-performance VLSI signal processor design.Indeed, there are few elds that enjoy so many applicationssignal process-ing is everywhere in our lives.Whenoneusesacellularphone,thevoiceiscompressed,coded,andmodulated using signal processing techniques. As a cruise missile winds alonghillsidessearching for the target, the signalprocessor is busy processingtheimages takenalongtheway. WhenwearewatchingamovieinHDTV,millionsofaudioandvideodataarebeingsenttoourhomesandreceivedwith unbelievable delity. When scientists compare DNA samples, fast pat-tern recognition techniques are being used. On and on, one can see the impactof signal processing in almost every engineering and scientic discipline.Becauseoftheimmenseimportanceofsignalprocessingandthefast-growingdemandsofbusinessandindustry,thisseriesonsignalprocessingserves to report up-to-date developments and advances in the eld. The topicsof interest include but are not limited to the following:

Signaltheoryandanalysis

Statisticalsignalprocessing

SpeechandaudioprocessingCopyright2004 by Marcel Dekker, Inc. All Rights Reserved.

Imageandvideoprocessing

Multimediasignalprocessingandtechnology

Signalprocessingforcommunications

SignalprocessingarchitecturesandVLSIdesignWe hope this series will provide theinterestedaudience withhigh-quality, state-of-the-art signalprocessing literature through research mono-graphs, editedbooks, andrigorouslywrittentextbooksbyexpertsintheirelds.Copyright2004 by Marcel Dekker, Inc. All Rights Reserved.PrefaceDuetotherapidlyincreasingcomplexityandheterogeneityof embeddedsystems, a single expert can no longer be master of all trades. The era in whichanindividualcouldtakecareofallaspects(functionalaswellasnonfunc-tional) of specication, modeling, performance/cost analysis, exploration,and verication in embedded systems and software design will be over soon.Futureembeddedsystemswillhavetorelyonconcurrencyandparallelismtosatisfyperformanceandcostconstraintsthatgowiththecomplexityofapplications andarchitectures. Thus, anexpert is familiar withandfeelscomfortable at only a few adjacent levels of abstraction while the number ofabstraction levels in between a specication and a system implementation issteadily increasing. But even at a single level of abstraction, experts will mostlikely have good skills only in either computation- or communication-relatedissues,asaresultofwhichthenotionofseparationofconcernswillbecomecrucial.Theseobservationshavefarreachingconsequences.Oneofthemisthat newdesign methodologies must be devised in which the notions of levels ofabstractionandseparationofconcernshavegrownintofundamental con-cepts and methods.Anearlyviewonabstractionlevels is representedbythe Y-chartintroducedbyGajski andKuhn[1]. Thischartgivesthreemodel viewstobehavioral, structural, andphysicalshowinglevels of abstractionacrosswhichrenementstakeplace. Amorerecentviewonlevelsofabstractionand the relation between behavior and structure on these levels is reected intheAbstractionPyramidandtheY-chartapproachintroducedbyKienhuisCopyright2004 by Marcel Dekker, Inc. All Rights Reserved.et al. [2]. The Y-chart approach obeys a separation of concerns principle bymakingalgorithms,architecture,andmappingmanifesttopermitquanti-cation of choices. When looking more closely at the Gajski Y-chart and theKienhuisY-chartapproachesoneseesthattheapproachisinvariant to thelevels of abstraction: oneachlevel thereissomearchitectureor componentthatisspeciedintermsofsomemodel(s)ofarchitecture,thereareoneormoreapplicationsthatarespeciedinterms ofsomemodel(s)ofcomputa-tion, and there are mapping methods that take components of the applicationmodeltocomponentsofthearchitecturemodel.TheY-chart,ontheotherhand, clearly reveals that models and methods will be dierent on each level ofabstraction: renements take place when going down the abstraction levels, asa consequence of which the design space is narrowed down step by step untilonly a few (Pareto optimal) designs remain.Another consequence of the larger number of abstraction levels and theincreasing amount of concurrency is that the higher the level of abstraction,the larger the dimension and size of the design space. To keep such a relationmanageable,itisnecessarytointroduceparametrizedarchitecturesortem-plates that can be instantiated to architectures. The types of the parametersdependonthe level of abstraction. For example, onthehighest level, amethod of synchronization and the number of a particular type of computingelementmaybe parameters.Often,thetemplateitselfisa versionof aplat-form. Aplatformis application-domain-specic andhas tobe denedthroughadomainanalysis. Forexample, platformsintheautomotiveap-plicationdomainare quite dierent fromplatforms inthe multimediaapplicationdomain. Intheformer, platforms must matchcodesignnitestate machine models [3], while in the latter they will have to support dataownetwork or process network models [4].Roughly speaking, a platform consists of two parts: one that concernsprocessing elements, and one that encompasses a communication and storageinfrastructure.Thispartitioningiscompliantwiththerulecomputationvs.communication separation of concerns rule [5]. The processing elements aretakenfromalibraryoftenasintellectual propertycomponentsandthecommunicationandstorage infrastructure is obeyingcertainpredenedconstructionandoperationrules. Specifyingaplatformisstill moreofanart than a science issue.Howcouldasoundmethodologybedesignedtoovercomethemanyproblems that let to the paradigm change in embedded systems and softwaredesign?Thereiscurrentlynocompellinganswertothat question. Anin-creasingnumberof researchgroupsall overtheglobeareproposingandvalidatingprototype methodologies, most of whichare embeddedintheSystemCframework.Anincreasingnumberofthemareadvocatingaplat-form-based design approach that relies heavily on models and methods thatcansupporttwomajordecisionsteps: explorationonaparticularlevel ofCopyright2004 by Marcel Dekker, Inc. All Rights Reserved.abstraction to prune the design space, and decomposition and composition tomove down and up the abstraction hierarchy. The view that currently seemstoprevail is athree-layer approach: anapplicationlayer, anarchitecturelayer, and a mapping layer. The application layer and the architecture layerbearmodelsofapplication(s)andthearchitecture,respectively,thatmatchin the sense that a mapping of the former onto the latter is transparent. Themapping layer transforms application models into architecture models. Pre-sentapproachesdierinthewaythemappingofapplicationsontoanar-chitectureisconceived.Oneapproachistorenetheapplicationmodeltomatch the architecture model in such a way that only a system model, i.e., animplementation of the application, is to be dealt with when it comes to per-formance/cost analysis or exploration.Anotherapproachistoadherestrictlytotheseparationofconcernsprinciples, implyingthat applicationmodels andarchitecture models arestrictlyseparated. Inthiscase, themappinglayerconsistsofanumberoftransformationsthatconvertrepresentationsofcomponentsoftheapplica-tion model to representations of components of the architecture model. Forexample,aprocessinanapplicationmodeledasaprocessnetworkcanberepresented by a control data ow graph (symbolically at higher [6] levels ofabstractionandexecutableatlowerlevelsofabstraction)andtransformedsubsequently in the mapping layer to come closer to the architecture process-ing unit model that supports various execution, synchronization and storageavors.Thisbookoersadozenessential contributionsonvariouslevelsofabstraction appearing in embedded systems and software design. They rangefrom low-level application and architecture optimizations to high-level mod-elingandexplorationconcerns,aswellasspecializationsintermsofappli-cations, architectures, and mappings.Therstchapter,byWalters,Glossner,andSchulte,presentsarapidprototyping tool that generates structural VHDLspecications of FIRlters.Theobject-orienteddesignof thetool facilitates extensions toit that in-corporate new optimizations and techniques. The authors apply their VHDLgenerationtooltoinvestigatetheuseoftruncatedmultipliersinFIRlterimplementations, anddemonstrateinthis studysignicant areaimprove-ments with relatively small computational error.The next chapter, by Lagadec, Pottier, and Villellas-Guillen, presents atool forgeneratingcombinational FPGAcircuitsfromsymbolicspecica-tions. The tool operates on an intermediate representation that is based on agraph of lookup tables, which is translated into a logic graph to be mappedonto hardware. A case study of a Reed-Solomon RAID coderdecoder gen-erator is used to demonstrate the proposed techniques.Thethirdchapter,byGuevorkian,Liuha,Launiainen,andLappalai-nen, develops several architectures for discrete wavelet transforms based onCopyright2004 by Marcel Dekker, Inc. All Rights Reserved.their owgraph representation. Scalability of the architectures is demonstra-tedintradingohardwarecomplexityandperformance.Thearchitecturesarealsoshowntoexhibitecientperformance,regularity,modularity,andamenability to semisystolic array implementation.Chapter 4, byTakalaandJarvinen, develops the concept of stridepermutation for interleaved memory systems in embedded applications. Therelevance of stride permutation to several DSP benchmarks is demonstrated,and a technique is developed for conict-free stride permutation access undercertain assumptions regarding the layout of data.Thenext chapter, byPimentel, Terpstra, Polstra, andCoand, dis-cusses techniques used for capturing intratask parallelism during simulationin the Sesame environment for design space exploration. These techniques arebased on systematic renement of processor models into parallel functionalunits, and a dataow-based synchronization mechanism. A case study of QRdecomposition is used to validate the results.Thenext chapter, byHannigandTeich, develops anapproachforpower modeling and energy estimation of piecewise regular processor arrays.The approach is based on exploiting the large reduction in power consump-tionforconstant inputstofunctional units. Anecient algorithmisalsopresented for computing energy-optimal space-time mappings.Chapter 7, by Derrien, Quillou, Quinton, Risset, and Wagner, presentsan interface synthesis tool for regular architectures. Safety of the synthesizeddesigns is assured through joint hardware/software synthesis from a commonspecication. Eciency of the generated interfaces is demonstrated throughexperiments with DLMS lter implementation on a Spyder FPGA board.InChapter8, byLohani andBhattacharyya, amodel developedforexecuting applications withtime-varyingperformance requirements andnondeterministicexecutiontimesonarchitectureswithrecongurablecom-putation and communication structures. Techniques are developed to reducethe complexity of various issues related to the model, and a heuristic frame-work is developed for eciently guiding the process of runtime adaptation.Thenextchapter,byTurjan,Kienhuis,andDeprettere,developsandcompares alternative realizations of the extended linearization model, whichisanapproachforreorderingtokenswheninterprocessdataarrivesoutoforderduringtheexecutionofaKahnprocessnetwork(KPN). ThemodelinvolvesaugmentingKahnprocesseswithadditional memoryandacon-troller in a manner that preserves KPNsemantics. The alternative realizationsare compared along various dimensions including memory requirements andcomputational complexity.The next chapter, by Radulescu and Goossens, compares networks-on-chip with o-chip networks and existing on-chip interconnects. Network-on-chipservicesaredenedandatransactionmodelisdevelopedtofacilitatemigration to the new communication architecture. Properties of connectionsCopyright2004 by Marcel Dekker, Inc. All Rights Reserved.undertheproposednetwork-on-chipframeworkincludetransactioncom-pletion, transaction orderings, performance bounds, and ow control.Chapter 11, by Stravers and Hoogerbugge, presents an architecture andprogrammingmodel forsingle-chipmultiprocessingwithhighpower/per-formanceeciency. Thearchitectureisbasedonhomogeneousclustersofprocessors, called tiles, that form the boundary between static and dynamicresourceallocation. CasestudiesofanMPEG2decoderareusedtodem-onstrate the proposed ideas.The last chapter, byWong, Vassiliadis, andCotofana, reviews theevolutionofembeddedprocessorcharacteristicsinrelationtoprogramm-ability and recongurability. Acase for embedded architectures that integrateprogrammableprocessorsandrecongurablehardwareisdeveloped,andaspecic approach is described for achieving this by means of microcode.Wewouldliketothankall thechapterauthorsfortheiroutstandingcontributions. Wealsothankat Marcel Dekker Inc., B. J. Clarkfor hisencouragementtodevelopthisbookandBrianBlackforhishelpwiththeproductionprocess.Thanksalsotoallthereviewersfortheirhardworkinhelping to ensure the quality of the book.Shuvra S. BhattacharyyaEd DeprettereJuergen TeichREFERENCES1. Gajski,D.(1987).SiliconCompilers.Addison-Wesley.2. Kienhuis, B., Deprettere, E. F., vanderWolf, P., Vissers, K. (2002). Ameth-odologytodesigningembeddedsystems: they-chartapproach. In: Deprettere,E. F., Teich, J., Vassiliadis, S., eds. EmbeddedProcessor Design Challenges,LectureNotesinComputerScience.Springer.3. Balarin, F., etal. (1997). Hardware-SoftwareCo-DesignofEmbeddedSystems:ThePolisApproach.KluwerAcademicPublishers.4. Kienhuis, B., Deprettere, E. F., Vissers, K., vanderWolf, P. (July1997). Anapproach for quantitative analysis of application-specic dataow architectures.In:ProceedingsoftheInternationalConferenceonApplicationSpecicSystems,Architectures,andProcessors.5. Keutzer, K., Malik, S., Newton, R., Rabaey, J., Sangiovanni-Vincentelli, A.(December 19, 2000). System-level design: orthogonalizationof concerns andplatform-based design. IEEE Transactions on Computer-Aided Design ofIntegratedCircuitsandSystems.6. Zivkovic, V., et al. (1999). Fast andaccuratemultiprocessorexplorationwithsymbolicprograms. ProceedingsoftheDesign, AutomationandTestinEuropeConference.Copyright2004 by Marcel Dekker, Inc. All Rights Reserved.ContentsSeriesIntroductionPreface1. AutomaticVHDLModelGenerationofParameterizedFIRFiltersE.GeorgeWaltersIII,JohnGlossner,andMichaelJ.Schulte2. AnLUT-BasedHighLevelSynthesisFrameworkforRecongurableArchitecturesLocLagadec,BernardPottier,andOscarVillellas-Guillen3. HighlyEcientScalableParallel-PipelinedArchitecturesforDiscreteWaveletTransformsDavidGuevorkian,PetriLiuha,AkiLauniainen,andVilleLappalainen4. StridePermutationAccessinInterleavedMemorySystemsJarmoTakalaandTuomasJarvinen5. OnModelingIntra-TaskParallelisminTask-LevelParallelEmbeddedSystemsAndyD.Pimentel,FrankP.Terpstra,SimonPolstra,andJoeE.CoandCopyright2004 by Marcel Dekker, Inc. All Rights Reserved.6. EnergyEstimationandOptimizationforPiecewiseRegularProcessorArraysFrankHannigandJu rgenTeich7. AutomaticSynthesisofEcientInterfacesforCompiledRegularArchitecturesStevenDerrien,Anne-ClaireGuillou,PatriceQuinton,TanguyRisset,andCharlesWagner8. Goal-DrivenRecongurationofPolymorphousArchitecturesSumitLohaniandShuvraS.Bhattacharyya9. RealizationsoftheExtendedLinearizationModelAlexandruTurjan,BartKienhuis,andEdF.Deprettere10. CommunicationServicesforNetworksonChipAndreiRadulescuandKeesGoossens11. Single-ChipMultiprocessingforConsumerElectronicsPaulStraversandJanHoogerbugge12. FutureDirectionsofProgrammableandRecongurableEmbeddedProcessorsStephanWong,StamatisVassiliadis,andSorinCotofanaCopyright2004 by Marcel Dekker, Inc. All Rights Reserved.ContributorsShuvra S. Bhattacharyya University of Maryland at College Park, CollegePark, MarylandJoeE.Coand DepartmentofComputerScience,UniversityofAmster-dam, Amsterdam, The NetherlandsSorin Cotofana Computer Engineering Laboratory, Electrical EngineeringDepartment, Delft University of Technology, Delft, The NetherlandsEdF. Deprettere LeidenEmbeddedResearchCenter, LeidenUniversity,Leiden, The NetherlandsSteven Derrien Irisa, Campus de Beaulieu, Rennes, FranceJohn Glossner Sandbridge Technologies, White Plains, New York, U.S.A.Kees Goossens Philips Research Laboratories, Eindhoven, The NetherlandsDavid Guevorkian Nokia Research Center, Tampere, FinlandAnne-Claire Guillou Irisa, Campus de Beaulieu, Rennes, FranceFrank Hannig University of Paderborn, Paderborn, GermanyCopyright2004 by Marcel Dekker, Inc. All Rights Reserved.Jan Hoogerbugge Philips Research Laboratories, Eindhoven, The Nether-landsTuomas Ja rvinen Tampere University of Technology, Tampere, FinlandBart Kienhuis Leiden Embedded Research Center, Leiden University,Leiden, TheNetherlandsLocLagadec Universite deBretagneOccidentale, UFRSciences, Brest,FranceVille Lappalainen Nokia Research Center, Tampere, FinlandAki Launiainen Nokia Research Center, Tampere, FinlandPetri Liuha Nokia Research Center, Tampere, FinlandSumit Lohani University of Marylandat College Park, College Park,MarylandAndy D. Pimentel Department of Computer Science, University of Amster-dam, Amsterdam, The NetherlandsBernard Pottier Universite de Bretagne Occidentale, UFR Sciences, Brest,FranceSimon Polstra Department of Computer Science, University of Amsterdam,Amsterdam, The NetherlandsPatrice Quinton Irisa, Campus de Beaulieu, Rennes, FranceAndreiRa dulescu Philips ResearchLaboratories,Eindhoven,TheNether-landsTanguy Risset LIP, ENS Lyon, FranceMichael J. Schulte ECEDepartment, UniversityofWisconsinMadison,Madison, Wisconsin, U.S.A.Paul Stravers Philips Research Laboratories, Eindhoven, The NetherlandsJarmo Takala Tampere University of Technology, Tampere, FinlandCopyright2004 by Marcel Dekker, Inc. All Rights Reserved.Ju rgen Teich University of Paderborn, Paderborn, GermanyFrank P. Terpstra Department of Computer Science, University of Amster-dam, Amsterdam, The NetherlandsAlexandruTurjan LeidenEmbeddedResearchCenter,LeidenUniversity,Leiden, The NetherlandsStamatis Vassiliadis Computer EngineeringLaboratory, Electrical Engi-neering Department, Delft University of Technology, Delft, The NetherlandsOscar Villellas-Guillen Universite de Bretagne Occidentale, UFR Sciences,Brest, FranceCharles Wagner Irisa, Campus de Beaulieu, Rennes, FranceE.GeorgeWalters III CSEDepartment,LehighUniversity,Bethlehem,Pennsylvania, U.S.A.StephanWong ComputerEngineeringLaboratory,ElectricalEngineeringDepartment, Delft University of Technology, Delft, The NetherlandsCopyright2004 by Marcel Dekker, Inc. All Rights Reserved.1AutomaticVHDLModel GenerationofParameterizedFIRFiltersE.GeorgeWaltersIIICSEDepartment,LehighUniversity,Bethlehem,Pennsylvania,U.S.A.JohnGlossnerSandbridgeTechnologies,WhitePlains,NewYork,U.S.A.MichaelJ.SchulteECEDepartment,UniversityofWisconsinMadison,Madison,Wisconsin,U.S.A.I. INTRODUCTIONDesigning hardware accelerators for embedded systems presents many trade-osthatarediculttoquantifywithoutbit-accuratesimulationandareaand delay estimates of competing alternatives. Structural level VHDL modelscan be used to evaluate and compare designs, but require signicant eort togenerate.This chapter presents a tool that was developed to evaluate the tradeosinvolvedinusingtruncatedmultipliersinFIRlterhardwareaccelerators.The tool is based on a package of Java classes that models the building blocksof computational systems, suchas adders andmultipliers. These classesgenerateVHDLdescriptions, andareusedbyotherclassesinhierarchicalfashion to generate VHDL descriptions of more complex systems. This chap-ter describes the generation of truncated FIR lters as an example.Previous techniques for modeling and designing digital signal process-ing systems with VHDL were presented in references 15. The tool describedinthischapterdiersfromthosetechniquesbyleveragingthebenetsofobject oriented programming (OOP). By subclassing existing objects, such asCopyright2004 by Marcel Dekker, Inc. All Rights Reserved.multipliers, thetool iseasilyextendedtogenerateVHDLmodelsthatin-corporate thelatestoptimizations and techniques.Subsections AandBprovide the backgroundnecessaryfor under-standingthetwoscomplementtruncatedmultipliersusedintheFIRlterarchitecture, which is described in Section II. Section III describes the tool forautomatically generating VHDL models of those lters. Synthesis results ofspeciclterimplementationsarepresentedinSectionIV, andconcludingremarks and given in Section V.A. TwosComplementMultipliersParallel tree multipliers form a matrix of partial product bits which are thenadded to produce a product. Consider an m-bit multiplicand A and an n-bitmultiplierB.IfAandBareintegersintwoscomplementform,thenA am12m1Xm2i0ai2iand B bn12n1Xn2j0bj2j1MultiplyingAandBtogetheryieldsthefollowingexpressionA B am1bn12mn2Xm2i0Xn2j0aibj2ijXm2i0bn1ai2in1Xn2j0am1bj2jm12The rst two terms in Eq. (2) are positive. The third term is either zero(ifbn1=0)ornegativewithamagnitudeofPm2i0ai2in1(ifbn-1=1).Similarly, the fourth termis either zero or a negative number. To produce theproduct of A B, the rst two terms are addedas is. Since the third andfourthterms are negative (or zero), they are addedbycomplementingeachbit,adding1 totheLSBcolumn,andsignextendingwitha leading1. Withthesesubstitutions, theproduct iscomputedwithout anysub-tractions asP am1bn12mn2Xm2i0Xn2j0aibj2ijXm2i0bn1ai2in1Xn2j0am1bj2jm1 2mn1 2n12m13Figure 1 shows the multiplication of two 8-bit integers in twoscomplementform. Thepartial product bit matrixisdescribedbyEq. (3),Copyright2004 by Marcel Dekker, Inc. All Rights Reserved.and is implemented using an array ofAND andNAND gates. The matrix is thenreducedusingtechniquessuchasWallace[6], Dadda[7], orreducedareareduction[8].B. TruncatedMultipliersTruncatedm nmultipliers, whichproduceresultslessthanm+nbitslong,aredescribedin[9].Benetsoftruncatedmultipliersincludereducedarea,delay,andpowerconsumption[10].Anoverviewoftruncatedmulti-pliers, whichdiscussesseveral methodsforcorrectingtheerrorintroducedduetounformedpartialproductbits,isgivenin[11].Themethodusedinthischapterisconstantcorrection,asdescribedin[9].Figure 2 shows an 8 8 truncated parallel multiplier with a correctionconstant added. The nal result is l-bits long. We dene k as the number oftruncated columns that are formed, and r as the number of columns that arenot formed. Inthis example, theveleast signicant columns of partialproductbitsarenotformed(l=8,k=3,r=5).TruncationsavesanANDgateforeachbitnotformedandeliminatesthefulladdersandhalfaddersthatwouldotherwiseberequiredtoreducethemtotworows.Thedelayduetoreducingthepartialproductmatrixisnot improvedbecausetheheight of thematrixisunchanged. However, ashortercarrypropagateadderisrequired, whichmayimprovetheoveralldelayofthemultiplier.The correction constant, Cr, and the1 added for rounding are nor-mally included in the reduction matrix. In Fig. 2 they are explicitly shown tomakethe concept more clear.A consequence of truncation is that a reduction error is introduced duetothediscardedbits. Forsimplicity, theoperandsareassumedtobeinte-Figure1 8 8partialproductbitmatrix(twoscomplement).Copyright2004 by Marcel Dekker, Inc. All Rights Reserved.gers, butthetechniquecanalsobeappliedtofractional ormixednumbersystems.Withrunformedcolumns,thereductionerrorisEr Xr1i0Xij0aijbj2i4If A and B are random with a uniform probability density, then the av-erage value of each partial product bit is14, so the average reduction error isEravg 14Xr1q0q 12q 14r 1 2r 15Thecorrectionconstant,Cr,ischosentoosetEr_avgandisCr round2rEravg 2r round r 1 222r2 2r6whereround(x)indicatesxisroundedtothenearestinteger.II. FIRFILTERARCHITECTUREThissectiondescribesthearchitectureusedtostudytheeectoftruncatedmultipliersinFIRlters. Littleworkhasbeenpublishedinthisarea, andFigure2 8 8truncatedmultiplierwithcorrectionconstant.Copyright2004 by Marcel Dekker, Inc. All Rights Reserved.this architecture incorporates the novel approach of combining all constantsfor twos complement multiplication and correction of reduction error into asingleconstant addedjust priortocomputingthenal lteroutput. Thistechniquereducestheaveragereductionerrorofthelterbyseveral ordersofmagnitude, whencomparedtotheapproachofincludingtheconstantsdirectlyinthemultipliers.SubsectionApresentsanoverviewofthearchi-tecture,andsubsectionBdescribescomponentswithinthearchitecture.A. ArchitectureOverviewAnFIRlterwithTtapscomputesthefollowingdierenceequation[12],yn XT1k0bk xn k 7where x[ ] is the input data stream, b[k] is the kthtap coecient, and y[ ] is theoutputdatastreamofthelter. Sincethetapcoecientsandtheimpulseresponse, h[n], are related byhn bn; n 0; 1; . . . ; T 10; otherwise&8Equation (7) can be recognized as the discrete convolution of the input streamwith the impulse response [12].Figure 3 shows the block diagram of the FIR lter architecture used inthischapter.Thisarchitecturehastwodatainputs,x_inandcoeff,andonedataoutput,y_out.Therearetwocontrolinputsthatarenotshown,clkandloadtap.The input data stream enters at the x_in port. When the filter is readyto process a new sample, the data at x_in is clocked into the register labeledx[n] in the block diagram. The x[n] register is one of T shift registers, whereTisthenumberof tapsinthefilter. Whenx_inisclockedintothex[n]register, thevaluesintheotherregistersareshiftedright inthediagram,withtheoldestvalue,x[nT+1]beingdiscarded.Thetapcoecientsarestoredinanothersetofshiftregisters,labeledb[0]throughb[T1]inFig.3.Coecientsareloadedintotheregistersbyapplyingthecoecientvaluestothecoeaffportinsequenceandcyclingthe loadtap signal to load each one.Thelterispipelinedwithfourstages:operandselection, multiplica-tion,summation,andnaladdition.Operandselection. The number of multipliers inthe architecture iscongurable.ForalterwithTtapsandMmultipliers,eachmul-Copyright2004 by Marcel Dekker, Inc. All Rights Reserved.tiplierperforms[T/M] multiplicationsperinputsample. Theoper-ands for each multiplier are selected each clock cycle by an operandbusandclockedintoregisters.Multiplication. Each multiplier has two input operand registers,loaded by an operand bus in the previous stage. Each pair ofoperandsismultiplied,andthenaltworowsofthereductiontree(theproduct incarry-saveform)areclockedintoaregisterwheretheybecomeinputstothemulti-operandadderinthenext stage.Keepingthe result incarry-save form, rather thanusingacarrypropagateadder(CPA),reducestheoveralldelay.Summation.The multi-operand adder has carry-save inputs from eachmultiplier, as well as a carry-save input from the accumulator. Aftereach of the [T/M] multiplications have been performed, the output ofthe multi-operand adder (in carry-save form) is clocked into the CPAoperand register where it is added in the next pipeline stage.Final addition. In the nal stage, the carry-save vectors from the multi-operand adder and a correction constant are added by a specializedcarry-saveadderandacarry-propagateaddertoproduceasingleFigure3 ProposedFIRlterarchitecturewithTtapsandMmultipliers.Copyright2004 by Marcel Dekker, Inc. All Rights Reserved.result vector. The result is then clocked into an output register, whichis connected to the y_out output port of the filter.Theclksignalclocksthesystem. Theclockperiodissetsothatthemultipliers and the multi-operand adder can complete their operation withinoneclockcycle.Therefore,[T/M]clockcyclesarerequiredtoprocesseachinput sample. The finaladdition stageonly needs to operate once per inputsample,soithas[T/M]clockcyclestocompleteitscalculationandisgen-erallynotonthecriticalpath.B. ArchitectureComponentsThissectiondiscussesthecomponentsoftheFIRlterarchitecture.1. MultipliersInthischapter,twoscomplementparalleltreemultipliersareusedtomul-tiplytheinput databythelter coecients. Whenperformingtruncatedmultiplication, theconstant correctionmethod[9] is used. Theoutput ofeach multiplier is the nal two rows remaining after reduction of the partialproductbits, whichistheproductincarry-saveform[13]. Roundingdoesnot occur at the multipliers, each product is (l + k)-bits long. Including theextrakbitsinthesummationavoidsanaccumulationofroundoerrors.Roundingisdoneinthenaladditionstage.As described in subsection A, the last three terms in Eq. (3) areconstants. In this architecture, these constants are not included in the partialproduct matrix. Likewise, if using truncated multipliers, the correction con-stantisnotincludedeither. Instead, theconstantsforeachmultiplicationare added in a single operation in the nal addition stage of the lter. This isdescribedlaterinmoredetail.2. Multi-OperandAdderandAccumulatorAs shown in Eq. (7), the output of an FIR lter is a sum of products. In thisarchitecture, M products are computed per clockcycle. In each clock cycle,thecarry-saveoutputsofeachmultiplierareaddedandstoredintheaccu-mulator register, also in carry-save form. The accumulator is included in thesum, except with the rst group of products for a new input sample. This isaccomplishedbyclearingtheaccumulatorwhentherstgroupofproductsarrivesattheinputtothemulti-operandadder.The multi-operand adder is simply a counter reduction tree, similar toa counter reduction tree fora multiplier,except that it begins withoperandbitsfromeachinputinsteadofapartialproductbitmatrix.TheoutputofCopyright2004 by Marcel Dekker, Inc. All Rights Reserved.themulti-operand adder is the nal two rows of bits remaining after reduc-tion, whichisthesumincarry-saveform. Thisoutputisclockedintotheaccumulator register every clock cycle, and clocked into the carry propagateadder(CPA)operandregisterevery[T/M]cycles.3. CorrectionConstantAdderAsstatedpreviously, theconstantsrequiredfortwoscomplement multi-pliers and the correction constant for unformed bits in truncated multipliersare not included in the reduction tree but are added during the nal additionstage. A 1 for rounding the lter output is also added in this stage. All oftheseconstantsforeachmultiplierareprecomputedandaddedasasingleconstant,CTOTAL.All multipliers used in this chapter operate on twos complement oper-ands. From Eq. (3), the constant that must be added for an m n multiplieris2m+n1+2n1+2m1. WithTtaps, thereareTmultiplyoperations(assumingTisevenlydivisiblebyM),soavalueofCM T2mn1 2n1 2m1 9mustbeaddedinthenaladditionstage.Themultipliersmaybetruncatedwithunformedcolumnsof partialproduct bits. If there are unformed bits, the total average reduction error ofthelterisT Er_avg.ThecorrectionforthisisCR round T r 1 22 T 2r2 2r10To round the lter output to l bits, the rounding constant that must be used isCRND 2rk111Combining these constants, the total correction constant for the lter isCTOTAL CM CR CRND12Adding CTOTAL to the multi-operand adder output is done using a spe-cializedcarry-saveadder(SCSA), whichissimplyacarry-saveadderopti-mizedforaddingaconstantbitvector.Acarry-saveadderusesfulladdersto reduce three bit vectors to two. SCSAs dier in that half adders are usedin columns where the constant is a0 and specialized half adders are usedincolumnswheretheconstantisa1.Aspecializedhalfaddercomputesthe sum and carry-out of two bits plus a 1, the logic equations beingsi aiPbiand ci1 ai bi13Theoutput of theSCSAis theninput tothenal carrypropagateadder.Copyright2004 by Marcel Dekker, Inc. All Rights Reserved.4. Final CarryPropagateAdderTheoutputofthespecializedcarry-saveadderisthelteroutputincarry-saveform. Anal CPAisrequiredtocomputethenal result. Thenaladdition stage has [T/M] clock cycles to complete, so for many applicationsa simple ripple-carry adder will be fast enough. If additional performance isrequired, acarry-lookaheadaddermaybeused. UsingafasterCPAdoesnotincreasethroughput,butdoesimprovelatency.5. ControlAlterwithTtapsandMmultipliersrequires[T/M] clockcyclestopro-cesseachinput sample. Thecontrol circuit isastatemachinewith[T/M]states, implementedusing amodulo-[T/M] counter. The present state istheoutput of thecounter andis usedtocontrol whichoperands arese-lectedbyeachoperandbus. Inadditiontothepresent state, thecontrolcircuitgeneratesfourothersignals:(1)shiftData,whichshiftstheinputsamples (2)clearAccum, which clears the accumulator, (3)loadCpaReg,whichloadstheoutputofthemulti-operandadderintotheCPAoperandregister, and(4) loadOutput, whichloadsthefinal sumintotheoutputregister.III. FILTERGENERATIONSOFTWAREThe architecture described in Section II provides a great deal of exibility interms of operand size, the number of taps, and the type of multipliers used.Thisimpliesthat thedesignspaceisquitelarge. Inordertofacilitatethedevelopmentofalargenumberofspecicimplementations,atoolwasde-signedthatautomaticallygeneratessynthesizablestructuralVHDLmodelsgivenasetofparameters. Thetool, whichisnamedltergenerationsoft-ware (FGS), also generates test benches and les of test vectors to verify theltermodels.FGSiswritteninJavaandconsistsoftwomainpackages.Thearith-meticpackage,discussedinsubsectionA,issuitableforgeneraluseandisthefoundationofFGS.Thefgspackage,discussedinsubsectionB,isspe-cically for generating the lters described previously. It uses the arithmeticpackagetogeneratethenecessarycomponents.A. ThearithmeticPackageThe arithmetic package includes classes for modeling and simulating digitalcomponents. The simplest components include D ip-ops, half adders, andfull adders. Larger components suchas ripple-carryadders andparallelCopyright2004 by Marcel Dekker, Inc. All Rights Reserved.multipliersusethesmallercomponentsasbuildingblocks. Thesecompo-nents,inturn,areusedtomodelcomplexsystemssuchasFIRlters.1. CommonClassesandInterfacesFigure4showstheclassesandinterfacesusedbyarithmeticsubpackages.ThemostsignicantareVHDLGenerator,Parameterized,andSimulator.VHDLGenerator is an abstract class. Any class that represents a digitalcomponent and can generate a VHDL model of itself is derived fromthis class. It denes three abstract methods that must be imple-mented by all subclasses. genCompleteVHDL() generates a completeVHDLledescribingthecomponent. Thisleincludessynthesiz-able entity-architecture descriptions of all subcomponents used.genComponentDeclaration()generatesthecomponentdeclarationthat must be included in the entity-architecture descriptions of othercomponentsthatusethiscomponent.genEntityArchitecture()gen-erates theentity-architecture description of this component.Figure4 The arithmeticpackage.Copyright2004 by Marcel Dekker, Inc. All Rights Reserved.Parameterized is an interface implemented by classes whose instancescanbedenedbyaset of parameters. Theinterfaceincludesgetandset methodstoaccessthoseparameters. SpecicinstancesofParameterizedcomponents canbe easily modied by changingtheseparameters.Simulator is an interface implemented by classes that can simulate theiroperation.Theinterfacehasonlyonemethod,simulate,whichac-cepts a vector of inputs and return a vector of outputs. These inputsand outputs are vectors of IEEE VHDL std_logic_vectors [14].2. Thearithmetic.smallcomponentsPackageThearithmetic.smallcomponentspackageprovidescomponentssuchasDip-opsandfull addersthat areusedasbuildingblocksforlargercom-ponents such as registers, adders, and multipliers. Each class in this packageis derived from VHDLGenerator enabling each to generate VHDL for use inlargercomponents.3. Thearithmetic.addersPackageTheclassesinthispackagemodel varioustypesofaddersincludingcarry-propagateadders,specializedcarry-saveadders,andmulti-operandadders.All components intheseclasses handleoperands of arbitrarylengthandweight. ThisexibilitymakesautomaticVHDLgenerationmorecomplexthanit wouldbeif operandswereconstrainedtobethesamelengthandweight. However, this exibility is often required when an adder is used withanothercomponentsuchasamultiplier.Figure 5shows the arithmetic.adderspackage, whichis typical ofmanyofthearithmeticsubpackages. CarryPropagateAdderisanabstractclass fromwhichcarry-propagateadders suchas ripple-carryadders andcarry-lookaheadaddersarederived.CarryPropagateAdderisasubclassofVHDLGenerator andimplementstheSimulator andParameterizedinter-faces.UsinginterfacesandaninheritancehierarchysuchasthishelpmakeFGSbothstraightforwardtouseandeasytoextend. Forexample, anewtypeofcarry-propagateaddercouldbeincorporatedintoexistingcomplexmodelsbysubclassingCarryPropagateAdder.4. Thearithmetic.matrixreductionPackageThis package provides classes that perform matrix reduction, typically usedby multi-operandadders andparallel multipliers. These classes performWallace, Dadda, and reduced area reduction [68]. Each of these classes arederivedfromtheReductionTreeclass.Copyright2004 by Marcel Dekker, Inc. All Rights Reserved.5. Thearithmetic.multipliersPackageAParallelMultiplier class was implementedfor this chapter andis repre-sentativeofhowFGSfunctions.Parameterscanbesettocongurethemultiplierforunsigned, twoscomplement,orcombinedoperation.Thenumberofunformedcolumns,ifany,andthetypeofreduction,Wallace,Dadda,orreducedarea,mayalsobespecied. ABitMatrixobject, whichmodelsthepartial productmatrix,is theninstantiatedandpassedtoaReductionTreeobject for reduction.Through polymorphism (dynamic binding), the appropriate subclass of Re-ductionTree reduces the BitMatrix to two rows. These two rows can then bepassed to a CarryPropagateAdder object for nal addition, or in the case ofthe FIRlter architecture described in this chapter, to a multi-operand adder.The architecture of FGS makes it easy to change the bit matrix,reduction scheme, and nal addition methods. New techniques can be addedseamlesslybysubclassingappropriateabstractclasses.6. Thearithmetic.misccomponentsPackageThis package includes classes that provide essential functionality but do notlogicallybelonginother packages. This includes Bus, whichmodels theoperandbussesof theFIRlter, and Register,whichmodelsvarioustypesFigure5 The arithmetic.adderspackage.Copyright2004 by Marcel Dekker, Inc. All Rights Reserved.of data registers. Implementation of registers is done by changing the type ofip-opobjectsthatcomprisetheregister.7. Thearithmetic.firfiltersPackageThispackageincludesclassesformodelingidealFIRltersaswellasFIRltersbasedonthetruncatedarchitecturedescribedinSectionII.Theideal ltersareideal inthesensethat thedataandtapcoef-cients aredoubleprecisionoatingpoint. This is areasonableapprox-imationof inniteprecisionfor most practical applications. Thepurposeof anideal FIRlterobject istoprovideabaselineforcomparisonwithpracticalFIRltersandtoallowmeasurementofcalculationerrors.The FIRFilter class models FIR lters based on the architecture showninFig. 3. All operands inFIRFilter objects are consideredtobe twoscomplementintegers, andthemultipliersandthemulti-operandadderusereduced area reduction. There are many parameters that can be set includingthe tapcoecient anddatalengths, the number of taps, the number ofmultipliers,andthenumberofunformedcolumnsinthemultipliers.8. Thearithmetic.testingPackageThis packageprovides classes for testingcomponents generatedbyotherclasses, including parallel multipliers and FIR lters. The FIR lter test classgenerates a test bench and an input le of test vectors. It also generates a.vecleforsimulationusingAlteraMax+PlusII.9. Thearithmetic.gui PackageThis package provides graphical user interface (GUI) components for settingparametersandgeneratingVHDLmodelsforallofthelargercomponentssuchasFIRFilter,ParallelMultiplier,etc.TheGUIforeachcomponentisaJava Swing JPanel, which can be used in any swing application. These panelsmake setting component parameters and generating VHDL les simple andconvenient.B. ThefgsPackageWhereasthearithmeticpackageissuitableforgeneraluse,thefgspackageisspecictotheFIRlterarchitecturedescribedinSectionII.fgsincludesclasses for automating much of the work done to analyze the use oftruncatedmultipliersinFIRlters. Forexample, thispackageincludesadriverclassthat automaticallygeneratesalargenumberof dierent FIRltercongurationsforsynthesisandtesting.CompleteVHDLmodelsarethengenerated, aswell asTcl scriptstodrivethesynthesistool. TheTclCopyright2004 by Marcel Dekker, Inc. All Rights Reserved.scriptcommandsthesynthesisprogramtowriteareaanddelayreportstodisk les, which are parsed by another class in the fgs package thatsummarizes the data and writes it to a CSV le for analysis by a spreadsheetapplication.IV. RESULTSTable1presents somerepresentativesynthesis results that wereobtainedfromtheLeonardosynthesis tool andtheLCA300K0.6micronCMOSTable1 SynthesisResultsforFilterswith16-bitOperands,OutputRoundedto16-bits(OptimizedforArea)Synthesisresults Improvement(%)T M rArea(gates)Totaldelay(ns)ADproduct(gatesns) AreaTotaldelayADproduct12 2 0 16241 40.80 662633 12 2 12 12437 40.68 505937 23.4 0.3 23.612 2 16 10211 40.08 409257 37.1 1.8 38.216 2 0 17369 54.40 944874 16 2 12 13529 54.24 733813 22.1 0.3 22.316 2 16 11303 53.44 604032 34.9 1.8 36.120 2 0 19278 68.00 1310904 20 2 12 15475 67.80 1049205 19.7 0.3 20.020 2 16 13249 66.80 885033 31.3 1.8 32.524 2 0 20828 81.60 1699565 24 2 12 17007 81.36 1383690 18.3 0.3 18.624 2 16 14781 80.16 1184845 29.0 1.8 30.312 4 0 25355 20.40 517242 12 4 12 18671 20.34 379768 26.4 0.3 26.612 4 16 14521 20.04 291001 42.7 1.8 43.716 4 0 26133 27.20 710818 16 4 12 19413 27.12 526481 25.7 0.3 25.916 4 16 15264 26.72 407854 41.6 1.8 42.620 4 0 28468 34.00 967912 20 4 12 21786 33.90 738545 23.5 0.3 23.720 4 16 17636 33.40 589042 38.0 1.8 39.124 4 0 29802 40.80 1215922 24 4 12 23101 40.68 939749 22.5 0.3 22.724 4 16 18950 40.08 759516 36.4 1.8 37.5FilterCopyright2004 by Marcel Dekker, Inc. All Rights Reserved.standardcelllibrary.Improvementsinarea,delay,andarea-delayproductfor lters using truncated multipliers are given relative to comparable ltersusingstandardmultipliers.Table2presentsreductionerrorguresfor16-bit lters withTtaps andr unformedcolumns. Additional datacanbefound in [15], which also provides a more detailed analysis of the FIR lterarchitecture presentedinthis chapter, including reductionandroundoerror.Themainndingswere:1. Using truncatedmultipliers inFIRlters results insignicantimprovements in area. For example, the area of a 16-bit lter with4multipliersand24tapsimprovesby22.5%with12unformedcolumnsandby36.4%with16unformedcolumns. Weestimatesubstantial powersavingswouldberealizedaswell. Truncationhaslittleimpactontheoveralldelayofthelter.2. The computational error introduced by truncation is tolerable formanyapplications. Forexample, thereductionerrorSNRfora16-bit lter with 24 taps is 86.7 dB with 12 unformed columns and61.2dBwith16unformedcolumns. Incomparison, theroundoerrorforanequivalentlterwithouttruncationis89.1dB[15].3. Theaverage reductionerror of alter is independent of r (forT>4),andmuchlessthanthatofasingletruncatedmultiplier.Fora16-bitlterwith24tapsandr=12,theaveragereductionerror is only9.18 105ulps, whereanulpis aunit of leastTable2 ReductionErrorforFilterswith16-bitOperands,OutputRoundedto16-bitsFilter ReductionerrorT r SNRR(dB) jR(ulps) EAVG(ulps)12 0 l 0 012 12 89.70 0.268 4.57E-512 16 64.22 5.040 4.57E-516 0 l 0 016 12 88.45 0.310 6.10E-516 16 62.97 5.820 6.10E-520 0 l 0 020 12 87.48 0.346 7.60E-520 16 62.00 6.508 7.60E-524 0 l 0 024 12 86.69 0.379 9.18E-524 16 61.21 7.143 9.18E-5Copyright2004 by Marcel Dekker, Inc. All Rights Reserved.precision in the 16-bit product. In comparison, the averagereductionerrorofasingle16-bitmultiplierwithr=12is1.56 102ulps, andtheaverageroundoerrorofthesamemultiplierwithouttruncationis7.63 106ulps.V. CONCLUSIONSThischapterpresentedatoolusedtorapidlyprototypeparameterizedFIRlters. The tool was used to study the eects of using truncated multipliers inthose lters. It was based on a package of arithmetic classes that are used ascomponents in hierarchical designs, and are capable of generating structurallevel VHDLmodelsofthemselves. Usingtheseclassesasbuildingblocks,FirFilter objects generate complete VHDL models of specic FIR lters. Thearithmeticpackageisextendableandsuitableforuseinotherapplications,enablingrapidprototypingof othercomputational systems. Asapart ofongoingresearchatLehighUniversity,thetoolisbeingexpandedtostudyother DSP applications, and will be made available to the public in the nearfuture.*REFERENCES1. Lightbody, G., Walke, R., Woods, R. F., McCanny, J. V. (1998). Rapid SystemPrototypingofaSingleChipAdaptiveBeamformer.In:ProceedingsofSignalProcessingSystems, 285294.2. McCanny, J., Ridge, D., Yi, H., Hunter, J. (1997). Hierarchical VHDL LibrariesforDSPASICDesign. ProceedingsoftheIEEEInternational ConferenceonAcoustics, Speech, and Signal Processing, 675678.3. Pihl, J., Aas, E. J. (1996).AMultiplierandSquarerGeneratorforHighPer-formance DSPApplications. In: Proceedings of the 39th Midwest SymposiumonCircuits and Systems109112.4. Richards, M. A., Gradient, A. J., Frank, G. A. (1997). RapidPrototypingofApplication Specic Signal Processors. Kluwer Academic Publishers.5. Saultz,J.E.(1997).RapidPrototypingof Application-SpecicSignalProcess-ors(RASSP) In-ProgressReport.JournalofVLSISignalProcessingSystemsforSignal, Image,andVideoTechnology2947.6. Wallace, C.S. (1964). A Suggestion for a Fast Multiplier.IEEETransactionsonElectronicComputersEC-131417.*Asof thiswriting, thesoftwarehasbeenrefactoredandgiventhenameMagellan. Thecurrent state of Magellan can befound at http://www.lehigh.edu/~gew5/magellan.html, orbycontactingtheauthors.Copyright2004 by Marcel Dekker, Inc. All Rights Reserved.7. Dadda, L. (1965). SomeSchemesforParallel Multipliers. AltaFrequenza34:349356.8. Bickersta, K. C., Schulte, M. J., Swartzlander, E. E. (1995). Parallel ReducedAreaMultipliers.IEEEJournalofVLSISignalProcessing9:181191.9. Schulte, M. J., Swartzlander, E. E. (1993). Truncated Multiplication withCorrection Constant. VLSI Signal Processing VI. Eindhoven, Netherlands: IEEEPress,pp.338396.10. Schulte, M. J., Stine, J. E., Jansen, J. G. (1999). ReducedPowerDissipationThroughTruncatedMultiplication. IEEEAlessandroVoltaMemorial Work-shoponLowPowerDesign,Como,Italy, 6169.11. Swartzlander, E. E. (1999). Truncated Multiplication with ApproximateRounding. In: Proceedings of the 33rd Asilomar Conference on Signals, Circuits,and Systems, 14801483.12. Oppenheim, A. V., Schafer, R. W. (1999).Discrete-TimeSignal Processing. 2nded.UpperSaddleRiver,NJ:PrenticeHall.13. Koren, I. (1993). Computer ArithmeticandAlgorithms. EnglewoodClis, NJ:PrenticeHall.14. IEEEStandardMultivalueLogicSystemfor VHDLModel Interoperability(Stdlogic1164):IEEEStd11641993(26May1993).15. Walters, E.G. III (2002). Design Tradeos Using Truncated Multipliers in FIRFilterImplementations.Mastersthesis,LehighUniversity.Copyright2004 by Marcel Dekker, Inc. All Rights Reserved.2An LUT-Based High Level SynthesisFrameworkforReconfigurableArchitecturesLo cLagadec,BernardPottier,andOscarVillellas-GuillenUniversite deBretagneOccidentale,UFRSciences,Brest,FranceI. INTRODUCTIONA. GeneralContextIt is a fact that integration technology is providing hardware resources at anexponential ratewhilethedevelopmentmethodsinindustryareonlypro-gressing at a linear rate. This can be seen as the repetition of a common situ-ation where a mature technical knowledge is providing useful possibilities inexcess of current method capabilities. An answer to this situation is to changethedevelopmentprocessinordertoavoidworkrepetitionandtoprovidemore productivity by secure assembly of standard components.Thissituationisalsoknownfromcomputerscientistssinceitwasen-counteredearlierin theprogramminglanguage story[1].Thebeginningsofthis story were: (1) symbolic expression of computations (Fortran), (2) struc-turedprograms(Algol),(3)modularity,andcodeabstractionviainterfacesand hiding (Modula2, object-oriented programming).Modularity came at an age when ecient engineering of large programswas the main concern, and when the task of programming was overtaken byorganizational problems. Systemdevelopmentcanbeconsideredasanewageforcomputerarchitecturedesign,withhardwaredescriptionlanguagesneedingtobetranscendedbyahigherlevelofdescriptiontoincreasepro-ductivity.Companiesdeveloping applicationshave specicmethodsforde-sign and production management in which they can represent their products,Copyright2004 by Marcel Dekker, Inc. All Rights Reserved.tools,andhardwareorsoftwarecomponents.Themethodthatensuresthefeasibilityof aproduct leads technical choices anddevelopments. It alsochanges some of the rules in design organization, since most applications areachievedinatop-downfashionusingobjectmodelsorcodegeneratorstoreach functional requirements.B. ReconfigurableArchitecturesFPGAsareoneofthedrivingforcesforintegrationtechnologyprogresses,duetotheirincreasingapplications.Likesoftwareandhardwareprogram-ming languages, recongurable architectures are sensitive to scale mutations.Asthechipsizeincreases, thecharacteristics of theapplicationarchitecturechangewithnewneedsforstructuredcommunications, moreeciencyonarithmetic operators, and partial recongurability.The software follows slowly, migrating from HDL (Hardware Descrip-tion Language) to HLL (High Level Language). Preserving the developmentsand providing a sane support for production tools is a major issue. Recong-urable architectures cantake benets fromthe top-downdesignstyle ofsubsection A, by a high level of specialization in applications.C. MadeoMADEOis a mediumtermproject that makes use of open object modeling toprovide a portable access to hardware resources and tools on recongurablearchitectures.The project structure has three parts that interact closely (bottom-up):1. Recongurablearchitecturemodel anditsassociatedgenerictools.Therepresentationof practical architecturesonagenericmodelenablessharingofbasictoolssuchasplaceandroute,allocation,circuit edition [2]. Mapping a logic description to a particular tech-nology is achieved using generic algorithms from SIS [3], or PPart[4]. Specicatomicresourcessuchasmemories, sensorsoroper-ators, can be merged with logic, and the framework is extensible.2. High-level logiccompiler. Thiscompilerproducescircuitsassoci-ated to high level functionalities on a characterization of the abovemodel.Object-orientedprogrammingisnotrestrictedtoapartic-ular set of operators or types, andthenprovides thecapabilitytoproduceprimitivesforarbitraryarithmeticsorsymboliccom-puting.The compiler handles an intermediate level, which is a graphof lookup-tables carrying high-level values (objects). Thenthisgraphis translatedintoalogic graphthat will be mappedonCopyright2004 by Marcel Dekker, Inc. All Rights Reserved.hardware resources. The translator makes use of values produced inthehigh-levelenvironment,whichallowsimplementationofclas-sical optimizations without attaching semantics to operations at thelanguage level.3. System and architecture modeling. The computation architecture initsstaticordynamicaspectsisdescribedinthisframework. Forinstance,thesearegenericregulararchitectureswiththeirassoci-ated tools, processes, platform management, and system activity.Thecompilercanmakeuseoflogicgenerationtoproducecongurations, bind them to registers or memories, and produce aconguredapplication.Theabilitytocontrolplacingandroutinggivenbytherstpart,andsynthesisfromthesecondpart,allowsbuilding of complex networks of ne- or medium-grain elements.This chapter focuses on the logic compiler. Historically, this work hastaken ideas from symbolic translation to logic as described in [5] and knowl-edge on automatic generation of primitives in interpreters [6]. Relation to theobject-orientedenvironmentwasdescribedin[7]withlimitedsynthesisca-pabilitiesthatareremovedincurrentwork.Systemmodelingandprogramsynthesis has been demonstrated in the case study of a smart sensor camera [8]based on the same specication syntax as the one used in the current work.This chapter describes the general principles used for specication andlogic production, thendetails the transformations that are achieved. Anillustrationisgivenwithanexampleofacoder/decoderfamilyforRAID(RedundantArrayofInexpensiveDisks)systemswithquantitativeresultsinSectionV.II. AFRAMEWORKFORLOGICSYNTHESISA. ArchitectureModelingRecongurable architectures can mix dierent grains of hardware resources:logicElements,operators,communicationlines,buses,switches,memories,processors, and so on.MostFPGAs(FieldProgrammableGateArrays)providelogicfunc-tionsusingsmalllookupmemories(LUT)addressedbyasetofsignals.Asseen fromthe logic synthesis tools, an n-bit wide LUT is the most general waytoproduce anylogic functionof nbooleanvariables. There are knownalgorithms and tools for partitioning large logic tables or networks to targeta particular LUT-based architecture.LUTsare eectively interconnected duringthe conguration phase toformlogic. This is achievedusingvarious congurable devices suchasprogrammable interconnect points, switches, or shared lines. Some commer-Copyright2004 by Marcel Dekker, Inc. All Rights Reserved.cial architectures alsogroupseveral LUTs andregisters intocells calledcongurable logic blocks (CLB).Our model for the organization of these architectures is a hierarchy ofgeometric patterns of hardware resources. The model is addressedviaaspecic grammar [2] allowing the description of concrete architectures. Giventhisdescription,generictoolsoperatefortechnologymapping,andplacingand routing logicmodules. See Figure9 shows a viewof thegenericeditor.Circuits such as operators or computing networks are described by programsrealizing the geometric assembly of such modules and their connections.Usingthisframework,fewdaysofworkaresucienttobringuptheset of toolsonanewarchitecturewiththepossibilitytoport applicationcomponents. Onaconcreteplatform, itisthennecessarytobuildthebit-streamgenerationsoftwarebyrewritingtheapplicationdescriptionstothenativetools. Twopractical examplesarethexc6200that hasapublicar-chitecture and has been addressed directly, the Virtex 1 is addressed throughtheJBitsAPI, andotherimplementationsincludeindustrial prototypear-chitectures.If behavioral generatorsareknowntooer numerousbenetsoverHDL synthesis, including ease of specifying a specialized design and the abil-ity to performpartial evaluation [9], they generally remain dependent of somelibraries of modules appearing as primitives [10,11]. Our approach draws at-tention to itself by relying on a generic back-end tool in charge of the mod-ulesproduction. Therearenocommercial toolsorlibraryinvolvedintheow.B. ProgrammingConsiderationsApplications forne-grainrecongurablearchitectures canbespecializedwithout compromise and they should be optimized in terms of space and per-formance. In our view, too much emphasis is placed on the local performanceofstandardarithmeticunitsinthesynthesistoolsandalsointhespecica-tion languages.Arstconsequenceofthisadvantageistherestrictedrangeofbasictypes coming from the capabilities of ALU/FPUs or memory address mech-anisms. Control structures strictly oriented toward sequentiality are anotheraspectthatcanbecriticized.Asanexample,programmingformultimediaprocessor accelerators remains procedural in spite of all the experiences avail-able fromthe domainof dataparallel languages. Hardware descriptionlanguages have rich descriptive capabilities, however, the necessity to use li-brarieshasledthelanguagedesignerstorestricttheirprimitivestoalevelsimilar to C.Copyright2004 by Marcel Dekker, Inc. All Rights Reserved.Ouraimwastoproduceamoreexiblespecicationlevelwithdirectand ecient coupling to logic. This implies allowing easy creation of specicarithmetics representing the algorithm needs, letting the compilers automat-ically tune data width, and modeling computations based on well-understoodobject classes.Theexpectedeectwasaneasyproductionofdedicatedsupportforprocessesthat needahigh-level of availability, orwouldwasteprocessorresources in an integrated system. To reach this goal, we used specicationswith symbolic and functional characteristics with separate denition of dataon which the program is to operate.Sequential computations can be structured in various ways by splittingprograms on register transfers, either explicitly in the case of an architecturedescription,orimplicitlyduringthecompilation.Figure1showsthesetwoaspects, with a circuit module assembled in a pipeline and in a data-path. Inthe case of simple control loops or state machines, high-level variables can beused to retain the initial state with known values with the compiler retrievingprogressively the other states by enumeration [7]. Figure 2 shows a diagramwhere registers are provided to hold state values associated to high-level vari-ables that could be instance variables in an object.At this stage, we will consider the case of methods without side-eectsoperating on a set of objects. For the sake of simplicity we will rename theseFigure 1 The modules can be either at or hierarchical. The modules can be com-posed in order to produce pipelines or can be instantiated during architecturesynthesis.Copyright2004 by Marcel Dekker, Inc. All Rights Reserved.methodsfunctions, andthesetofobjectsvalues. Interactionwithexternalvariables is not discussed there. The input language is Smalltalk-80, and var-iantVisualWorks,whichisalsousedtobuildthetoolsandtodescribetheapplication architectures.C. ExecutionModelTheexecutionmodeltargetedbythecompileriscurrentlyahigh-levelrep-lication of LUT-based FPGAs. We dene a program as a function that needsto be executed on a set of input values. Thus the notion of programgroups thealgorithm and the data description at once. Our programcan be embedded inhigherlevelcomputationsofvariouskind,implyingvariablesormemories.Data descriptions are inferred fromthese levels. The resulting circuit is highlydependent from the data it is intended to process.An execution is the traversal of a hierarchical network of lookup tablesin which values are forwarded. Avalue change in the input of a table implies apossible change in its output that in turn induces other changes downstream.Thesenetworksreecttheeectivefunctionstructureattheprocedurecallgrain and they have a strong algorithmic meaning. Among the dierent pos-sibilitiesoeredforpractical execution, therearecascadedhashtableac-Figure2 Statemachinescanbeobtainedbymethodsoperatingonprivatevari-ableshavingknowninitialvalues.Copyright2004 by Marcel Dekker, Inc. All Rights Reserved.cessesanduseofgeneral purposearithmeticunitswheretheyaredetectedto t.Translation to FPGAs need binary representation for objects, as shownin Fig. 6. This is achieved in two ways, by using a specic encoding known tobe ecient, or by exchanging object values appearing in the input and outputfor indexes in the enumeration of values. Figure 3 shows a fan-in case with anaggregationofindexesintheinputoffunctionh(). Basicallythelow-levelrepresentationofanodesuchash()isaprogrammablelogicarray(PLA),having in its input the Cartesian product of the set of incoming indexes ( fout gout), and in its output the set of indexes for downstream.Some important results or observations from this exchange are:1. Datapathsinsidethenetworkdonot dependanymoreondatawidth but on the number of dierent values present on the edges.2. Dependingontheinterfacingrequirements, it will beneededtoinsert nodes intheinputandoutputof thenetworkto handletheexchanges between values and indexes.3. Logic synthesis tool capabilities are limited to medium-grainproblems. ToallowcompilationtoFPGAs, algorithmsmustde-crease the number of values down to nodes that can be easily handledby the bottomlayer (SIS partitioning for LUT-n). Today, this grainis similar to algorithms coded for 8-bit microprocessors.4. Decreasing the number of values is the natural way in whichfunctions operate, since the size of a Cartesian product on a func-tioninputvaluesisthemaximumnumberofvaluesproducedinthe output. The number of values carried by edges decreases eitherFigure3 Fan-infrom2nodeswithCard( fout gout)>>>>>>>>>>:9>>>>>>=>>>>>>;Computationsofpiecewiseregularalgorithmsmayberepresentedbya dependence graph (DG). The dependence graph of the algorithmofExample III.1is showninFig. 1a. The dependence graphexpresses thepartial order betweenthe operations. Eachvariable of the algorithmisrepresentedateveryindexpointI a Ibyonenode.Theedgescorrespondto the data dependencies of the algorithm. They are regular throughout thealgorithm(i.e., a[i,j,k] isdirectlydependentona[i,j1,k]). Thedependencegraph species implicitly all legal execution orderings of operations: if thereisadirectedpathinthedependencegraphfromonenodea[J] toanodez[K] where J, KaI, then the computation of a[J] must precede thecomputationofz[K].Henceforth, and without loss of generality,* we assume that all indexedvariables are embedded in a common index space I. Then, the correspondingdependence graphs can be representedin a reduced form.DenitionIII.3 (Reduceddependencegraph). A reduced dependencegraph (RDG) G = (V, E, D) of dimension n is a network where V is a set ofnodes and E p V V is a set of edges. To each edge e = (vi, vj) there is as-sociated a dependence vector dija D oZn.TheRDGofthematrixmultiplicationalgorithmisshowninFig.1a.Eachnode v in the graph corresponds to one equation in the section computationsofthealgorithm.*Allmethodsdescribedcanalsobeappliedtoeachquanticationindividually.Copyright2004 by Marcel Dekker, Inc. All Rights Reserved.B. Space-TimeMappingLinear transformations, as in Eq. (1), are used as space-time mappings [16,21]in order to assign a processor index p a Zn1(space) and a sequencing indext a Z (time) to index vectors I a I.pt

TI QE I 1InEq. (1), QaZ(n1)nand EaZ1n. Themainreasonsforusinglinearallocation and scheduling functions is that the data ow between PEs is localandregular, whichisessential forlow-powerVLSIimplementations. Theinterpretationof sucha linear transformationis as follows: The set ofoperationsdenedatindexpoints E I=constarescheduledatthesameFigure 1 In (a), an index space and the reduced dependence graph is shown. Somepossiblemappingsaredepictedin(b).Copyright2004 by Marcel Dekker, Inc. All Rights Reserved.time step. The index space of allocated processing elements (processor space)is denoted by Q and is given by the set Q = { p | p = Q I ^ I a I}. This setcanalsobe obtainedbychoosingaprojectionof the dependence graphalongavectoru aZn,i.e.,anycoprime*vectorusatisfyingQ u=0[16]describestheallocationequivalently.AllocationandschedulingmustsatisfythatnodatadependenciesintheDGareviolated. Thisisensuredbythefollowingcausalityconstraint,E dijz 0 b(vi, vj) a E. Asucientcondition forguaranteeingthatnotwoormoreindexpointsareassignedtoaprocessingelementatthesametimestepisgivenbyrankQE n 2Using theprojectionvectoru satisfyingQ u = 0,thiscondition isequiva-lentto E u p 0[28].DenitionIII.4 (Iterationinterval). [30].Theiterationinterval pofanallocatedandscheduledpiecewiseregularalgorithmisthenumberoftimeinstances between the evaluationof two successive instances of a variablewithinoneprocessingelement.DenitionIII.5 (Blockpipeliningperiod). [17].Theblockpipeliningperiodofanallocatedandscheduledpiecewiseregularalgorithmisthetimeinterval betweenthe initiations of twosuccessive probleminstances andisdenotedby b.LetusconsiderthematrixmultiplicationalgorithmintroducedinExampleIII.1 as a problem instance. The whole matrices A and B have to be read intothe processor array before the next pair can be read, the time between theseinput operations is the blockpipeliningperiodb. Let Ebe the schedulevector.Then,theblockpipeliningperiod bmaybecomputedasfollows.b maxI1a IfE I1g minI2a IfE I2g maxI1;I2a IfEI1 I2gIV. POWERMODELINGANDENERGYESTIMATIONIndigital CMOScircuits, thedominant sourceof power consumptionisswitching power [26]. The average power consumed by a CMOS gate can becomputedusingthefollowingequation,* Avectorxissaidtobecoprimeif theabsolutevalueof thegreatestvalueofthegreatestcommondivisorofitselementsisone.Copyright2004 by Marcel Dekker, Inc. All Rights Reserved.Psw 12 CLV2ddNfwhereCListhegateoutput loadcapacitance,Vdd isthesupplyvoltage,f istheclockfrequency, andNistheaverageorexpectednumberof outputtransitionsperclockcycle.Duetotheinuenceoftheswitchingactivityonthepowerconsump-tion, ourmainideaistoexploitthefactthatpowerconsumptionisdras-tically reduced when some inputs of a functional unit remain unchanged forn>1clockcycles.Here, we want to discuss the impact of the space-time mapping on thepower and energy consumption respectively of the resulting processor array.Ourapproachidentiesregionswithdecreasedswitchingactivityoffunc-tional units input operands and takes these power savings into account. Anestimationmethodologyis presentedinthefollowing. This methodologyestimates for a given piecewise regular algorithm and a space-time mappingTtheaveragepowerconsumptionoftheentirearray.Briey described, this methodology can be subdivided into twohierarchicalestimationsteps,PE-levelpowerestimation,andarray-levelpowerestimation.A. PE-LevelPowerEstimationAdiagramoftheinternalstructureofatypicalprocessorelementisshowninFig. 2. Itconsistsofacorewhereall thefunctional unitsarelocated,acontroller,andsomedelayregisters.InSectionV,wequantifytypicalper-Figure2 Schematicallyinternalstructureofoneprocessorelement.Copyright2004 by Marcel Dekker, Inc. All Rights Reserved.centages of power consumptionforthefunctional units PFU, thecontrolstructuresPCtrl, andtheregistersPRg, andtheseparts proportionof theoverall power consumption of one processing element. The power consump-tionof onePE can be approximatedas follows.PFUE; u PFUu PCtrlE PRgEForcharacterizationofthefunctionalunits(adders,multipliers,etc),stan-dardregister-transferlevel powerestimationtoolsfromSynopsys[27] areused.InTable1,theaveragepowerconsumptionofsome16-bitfunctionalunitsarelisted(A=ripple-carryadder, B=carry-savearraymultiplier,C = carry-save array multiplier with two pipeline stages, and D = Wallace-tree multiplier withthree pipeline stages). Eachfunctional unit has twoinputoperands.Thevalueofoneoperandisassumedtobeconstantfornclockcycles; the other canchange randomlyineveryclockcycle. Thesevalues are shown in Fig. 3a for the 16-bit ripple-carry adder and Fig. 3b forthe multipliers, respectively. The curves are derived by regression, where thefunctionisoftypeP=a0+a1en+a2nen+a3n2en.Theregressionisgoodenoughtohavelessthan2%error. Sinceweareonlyinterestedinintegermultiplesoftheclockcycleforn,thederivedmodelsmaybestoredinatablewithouttoomucheort.It canbeseenfromthesegures that thepower consumptionof afunctional unit depends heavily on the number of cycles of one input operandstaysconstant. Agoodestimationmethodology, therefore, shouldexploitthisobservationforobtainingaccurateestimationsandfromstudyingtheinuence of space-timemappings on theresultingpower consumption.Table1 AveragePowerConsumptionofDierentFunctionalUnitsn Pavg,APavg,BPavg,CPavg,D1 26.97 AW 204.2 AW 212.0 AW 319.6 AW2 22.33 AW 155.4 AW 164.0 AW 225.0 AW3 18.82 AW 138.6 AW 145.6 AW 190.1 AW4 16.99 AW 129.6 AW 137.3 AW 175.1 AW5 16.31 AW 125.4 AW 133.8 AW 164.3 AW6 15.68 AW 120.5 AW 128.4 AW 159.4 AW7 15.48 AW 119.5 AW 125.2 AW 153.3 AW8 15.29 AW 116.8 AW 124.4 AW 151.6 AW9 15.09 AW 116.3 AW 123.7 AW 147.8 AW10 14.89 AW 115.5 AW 122.7 AW 145.8 AWl0 8.49 AW Copyright2004 by Marcel Dekker, Inc. All Rights Reserved.Figure3 Averagepowerconsumptionofsome16-bitfunctional unitswhenoneoperandisconstantfornclockcyclesandtheothercanchangerandomlyineveryclockcycle.Copyright2004 by Marcel Dekker, Inc. All Rights Reserved.B. Array-LevelPowerEstimationBasedontheclassofpiecewiseregularalgorithms,wewanttoestimatethepower consumption for a given space-time mapping T=( QE)T. It is obviousthatthecost(numberofprocessorelements)andthelatencyisinuencedby the space-time mapping. In earlier work [13], we described how to deter-mine the cost and the latency as a measure of performance. Here, we brieyoutlinethemainideas. If weassumethat processor arrays areresource-dominant, we are able to approximate the cost as being proportional to theprocessorcount. Ehrhart polynomials[6,9] maybeevaluatedtocount thenumber of points (processor elements, #PE) in the projected index space.Thelatencyisdeterminedbysolvingaminimizationproblem, whichmay be formulated as a mixed-integer linear program (MILP) [29,30]. Also,modiedlowpowerschedulingandbindingtechniquesasin[23,25]canbeappliedtocomputeasuitedschedule.Here,wediscusstheimpactofthespace-timemappingonthepowerand energy consumption, respectively, of the resulting processor array. Ourapproachidentiesregionswithdecreasedswitchingactivityoffunctionalunits input operands and take these power savings into account. Anestimationalgorithmis presentedonthefollowingpages. The algorithmestimates for a given RDG G, an index space I, a space-time mapping T, thenumberofprocessorelements#PE,andtheblockpipeliningperiodb,theaveragepowerconsumptionParrayoftheentirearray.Theprocessorcount#PEandtheblockpipeliningperiodbof thearraymaybecomputedasdescribedearlierinthischapter.Oncetheaveragepower consumptionParrayof theentireprocessorarray is estimated, the energy consumption (per probleminstance) iscomputedasfollows,E b ParrayPOWERESTIMATION1 IN: RDGG, I,T=QE ,#PE,and b2 OUT:Parray3 BEGIN4 PPE p05 FORallnodesvaGDO6 Pv,1 plookUpPower(v,1)7 PPE pPPE+Pv,18 ENDFOR9 Parray p#PE PPE10 FORalledgeseaGDO11 disdependencevectorofedgee12 nodevpsource(e)Copyright2004 by Marcel Dekker, Inc. All Rights Reserved.13 nodewptarget(e)14 IF(v=w)THEN15 IF(Svispropagationequation)THEN16 IF(Q d=0)THEN17 FORalladjacentedgeseVofv18 dVisdependencevectorofedgeeV19 IF(dV=0)THEN20 wptarget(eV)21 Pw,1 plookUpPower(w,1)22 Pw,b plookUpPower(w, b)23 Parray pParray-#PE (Pw,1-Pw,b)24 ENDIF25 ENDFOR26 ENDIF27 ELSE28 (k,m)pgetOperandFixedCycles(T,v)29 Pw,1 plookUpPower(w,1)30 Pw,k plookUpPower(w,k)31 Parray pParray-m (Pw,1-Pw,k)32 ENDIF33 ENDIF34 ENDFOR35 ENDInourexperiments,weassumedthattheiterationperiod kisoneandthateachRDGnodeismappedontoadedicatedresource(noresourcesharingoffunctional units). Ourestimationalgorithmcanbesubdividedintotwophases. Intherstphase, theworstcasepowerconsumptioniscomputed(i.e., whentheswitchingactivityofall functional units input operandsishighest). Therefore, the power consumption PPE of one processor element isdeterminedbysummationofthepowerconsumptionPvi,1ofallofitsFUsPPE XbviaVPvi;1TheoneinthetermPvi1denotesthat operandscanchangeineveryclockcycle.Subsequently, thepowerconsumptionoftheentirearrayisobtainedbyextrapolationofthisvalue.Inthesecondphaseofthealgorithm,arrayregions withlower switchingactivityare detected. Therefore, the wholereduceddependencegraphistraversedtoexamineself-loops*(seeline14).*Aself-loopisanedgewheresourceandtargetnodearethesame.Copyright2004 by Marcel Dekker, Inc. All Rights Reserved.These self-loops correspond to inputs of a processor element. If these inputsremain unchanged for more than one period, the switching activity isdecreasedandconsequentlyalsothepower. It remains tobedeterminedhow many clock cycle inputs are constant and how many processor elementsareaected.Twocasescanbedierentiated.1. Propagationequationsmappedontoitself. Propagationequationsareusedonlytodistributedatafromoneprocessortoanother.Due tothe regularity andlocality of the consideredprocessorarrays, theyoccurverycommonly. If suchapropagationequa-tionismappedontoitself(Q d=0,seeline16)nodatatrans-port is needed(i.e., the dataremains inone processor elementunchangedfor bcyclesuntilthenextprobleminstanceisfedintothearray). Thus, theswitchingactivityof all adjacent nodes vi(functional units) in the same processor element is reduced.Therefore, theestimationvalueof theaveragepower consump-tion is corrected (decreased) by Pvi1-Pvi,b. As a propagationequation has global inuence, the activity is reduced in everyprocessorelement(#PE).2. Other self-loops. These are the remaininginputs whichmaybeconstantfork clockcycles. Let thenumberof processorelementswith these constant inputs be denoted m. Let Iin1be the input indexspace of variable ini. Transforming this indexspace byQandcounting the number of points in the transformed space, gives m.m j fI a Zn1j I Q Iin1 ^ Iin1a Iin1g jThiscountingproblemissimilartotheearlierproblemdescribedand can be obtained by a geometrical approach [5]. The number ofintegral points canbedeterminedbyconsiderationif thegivenprojection vector u ( Q u = 0) enters a facet of the index space Iin1andhowthickthisfacetmustbeuntil twopointsprojectedontoeach other. Algebraically, the thickness is derived from the value ofthe inner product of the normal vector of a facet and the projectionvector. The union of thick facets can be a nonconvex polytope. Thenumberof integral points inside this(non-)convex polytopeisde-termined by the use of Ehrhart polynomials [6].Once kandmare determined(see line 28, functiongetOperandFixedCycles),theoverallestimatedpowerconsump-tion value can be improved by subtracting m ( Pin1,1Pin1,k).In thenext section the overallalgorithmis explained and quantitativeresultsarediscussed.Copyright2004 by Marcel Dekker, Inc. All Rights Reserved.V. EXPERIMENTSReconsider the introductory Example III.1. As an allocation, we chose a 16-bitripple-carryadderfortheadditionandathree-stagepipelinedWallace-treemultiplier for themultiplication. Theinput operations aandbweremappedeachtooneresourceof typeinput. Theexecutiontimesof theseoperationsarezero.Thisisequivalenttoamulticastwithoutdelaytoasetof processors. Furthermore, let u = (1 0 0)Tbe the chosen projection vector.Then, after scheduling and cost calculus, we obtain the schedule vector E =(1 0 1) and as cost #PE = N2 N3. Now, with this information we are able toestimate the power consumption by applying the proposed algorithm. First,the worst-case power consumption is determined (i.e., the switching activityof functional units when input operands change each cycle). Second, in themain part of the algorithm, two types of equations with lower input activityaredetectedandtheoverallpowerconsumptionisrened.The processor array for a projection in direction u= (1 0 0)Tis shownin Fig. 4. Due to this projection, the variableb is mapped onto itself. Fromthisitfollowsthtoneoperandof themultiplicationremainsunchangedforsometime.Atthebeginningofacomputation,thewholematrixBisinputsimultaneously to the array, whereas the matrix A is fed sequentially row byrowfromtheleftsideintothearray.SincethematrixAhasN1rows,oneoperand of the multiplier is xed for b = N1 clock cycles, which signicantlyreduces the power consumption in the multipliers by 45% (see Table 1). Onaccountofthedesignregularitythepowersavingscanmultipliedby#PE(line 23 of the algorithm). The second point where less power is consumed isthe constant input variable c. One input of the adders in the lower row of theFigure4 Processorarrayforu=(100)T,N1=4,N2=5,andN3=2.Copyright2004 by Marcel Dekker, Inc. All Rights Reserved.processor arrayis permanentlyzero. These partial regions withreducedpower consumption in the array are determined by the functiongetOper-andFixedCycles. In addition to the time (k=l) where one inputremainsunchanged,thenumberm=N2ofprocessorswithreducedswitch-ingactivityisreturned.InTable2,thepowerconsumptionfordierentprojectionvectorsisshown, whereforillustrationpurposes, theupperboundariesoftheindexspace are set to N1 = 4, N2 = 5, and N3 = 2. In the table, Psim is the exactvalueobtainedbysimulationof theentirearray. Theworst caseextrap-olation(line49inthealgorithm)isdenotedbyPext.Thepowerconsump-tionof our estimationalgorithmis labeledwithPest. Where the simpleextrapolationmethodhaserrorsupto81%,ourapproachisveryaccuratewitherrorslessthan5%.Furthermore,theenergyvaluespermatrixmultiplicationinthetableshow the signicant inuence of the chosen space-timing mapping. Dierentmappingscan leadtoenergyconsumptionsthat candierupto a factor oftwo.A. QuantificationofthePowerConsumptionInsideOneProcessorElementInthissubsection, wequantifythepercentagesofpowerconsumptionforthefunctionalunitsPFU,thecontrolstructuresPCtrl,andtheregistersPRg.InTable3, theproportionsofthesepartstotheoverall powerconsump-tion of one processing element for the three unit vector mappings aredepicted. For the matrixmultiplicationalgorithmthe major part of thepowerconsumptioniscausedbythefunctional units, thispart isaround90%. Wherethepowerconsumptionof theregistersisonly4.16.6%ofthe total power consumptionof one processor element. It shouldbe re-called that the iteration interval is only one and no resource sharing is usedsinceateachindexpointonlyonemultiplicationandoneadditionhastobeperformed.Table2 AveragePowerandEnergyConsumptionofDierentMappingsuPsim[AW]Pext[AW]Errext[%]Pest[AW]Errest[%]Esim[pJ]Eest[pJ](100)T2020 3466 71.6 1928 4.6 80.8 77.1(010)T1530 2773 81.2 1456 4.8 76.5 72.8(001)T7260 6931 4.5 6931 4.5 145.2 138.6Copyright2004 by Marcel Dekker, Inc. All Rights Reserved.ThesecondexampleisapiecewiseregularalgorithmforLUdecom-position.InFig.5,apiecewiseregularprocessorarrayfortheLUdecom-positionis schematicallyshown. This arraycanbesubdividedintothreepieces, where the Parts A and B also change their functionality over the time.Since the Part BandCdivisions are performedthe percentage offunctional unit of the overall power consumption is greater than for the PartA. The percentages for the dierent parts of the LU decomposition array arelistedinTable3.Figure5 SketchofpiecewiseregularprocessorarrayforLUdecomposition.Table3 PercentagesofPowerConsumptionforFunctionalUnits,ControlStruc-tures,andRegistersAlgorithm u PFU[%] PCtrl[%] PRg[%]Matrixmultiplication (100)T87.8 5.7 6.6Matrixmultiplication (010)T86.6 7.1 6.3Matrixmultiplication (001)T91.8 4.1 4.1LUdecomposition(A) (100)T78.5 9.2 12.3LUdecomposition(B,C) (100)T88.0 6.1 5.9Copyright2004 by Marcel Dekker, Inc. All Rights Reserved.VI. DETERMINATIONOFENERGY-OPTIMALSPACE-TIMEMAPPINGSIn the previous section we showed that dierent space-time mappings have agreat inuence on energy consumption. In this section we want to make useofthisfacttodetermineenergy-optimalspace-timemappings.ThealgorithmproposedinSectionIV.Bdeterminesforanarbitrarygiven space-time mapping T QE , T a Znnthe power consumption. SinceT can equivalently described by a schedule vector E a Z1nand a projectionvector uaZnwe have 2n parameters of possible space-time mappings. In [13],wepresentedecientpruningtechniquesforthesearchofoptimal space-time mappings (projection vectors). Here, we summarize the main ideas: (1)only consider co-prime projection vectors u, and (2) only consider co-primevectorsthathavethepropertiesthatatleasttwopointsin Iareprojectedonto each other. This leads to a search space of co-prime vectors in a convexpolytope called dierence-body of points in I. Finally, in this reduced searchspace, we can exploit symmetry to exclude search vectors v= vV such thattypically, onlyfewprojectionvectorcandidatesvhavetobeinvestigated.Ehrhart polynomials [6,9] may be evaluated to count the number of points inthe projectedindex space. Let |U| be the number of projectionvectorcandidatesandforeachprojectionvectoruaUtheminimal latencywasdeterminedbysolvingamixedintegerlinearprogram(MILP) [13,29,30].Then, we must also estimate for |U| space-time mappings the powerconsumption.SincethesetUcanstillbeverylargeweproposeanecientheuristic methodology to nd an energy-optimal space-time mapping in thefollowing.ENERGYOPTIMIZATION1 IN: RDGG, I2 OUT: Eopt,Topt3 BEGIN4 Up5 FORalledgeseaGDO6 disdependencevectorofedgee7 IF(d p 0 ^d p U)THEN8 usedasprojectionvectorandconstructQfromit9 #PEpdetermineNoOfPEs(d)10 (E, b)pminimizeLatency(G, I,d)11 Parray ppowerEstimation(G, I,T,#PE, b)12 Epb Parray13 IF(EInput_ToPhase(1,1,10,Input_Portx_mirr2),2=>Input_ToPhase(2,1,1,Input_Portx_mirr2),3=>Input_ToPhase(3,2,6,Input_Portx_mirr1orInput_Portx_mirr2),4=>Input_ToPhase(5,2,1,Input_Portx_mirr1orInput_Portx_mirr2),5=>Input_ToPhase(7,2,2,Input_Portx_mirr1orInput_Portx_mirr2),6=>Input_ToPhase(9,4,78,Input_Portd_mirr1orInput_Porty_mirr1orInput_Portx_mirr1orInput_Portx_mirr2),7=>Input_ToPhase(13,4,1,Input_Portd_mirr1orInput_Porty_mirr1orInput_Portx_mirr1orInput_Portx_mirr2),8=>Input_ToPhase(17,1,1,Input_Portx_mirr2),9=>Input_ToPhase(0,1,4,(others=>0)));Copyright2004 by Marcel Dekker, Inc. All Rights Reserved.REFERENCES1. AnnapolisMicroSystem,Inc.WildStarDatasheet,2001.2. Catthoor, F., Danckaert, K., Kulkarni, C., OmnesT. (2000). DataTransferand Storage Architecture Issues and Exploration in Multimedia Processors. In:Programmable Digital Signal Processors: Architecture, Programming, andApplications.NewYork:MarcelDekkerInc.3. CeloxicaLtd.RC1000Datasheet,2001.4. Darte, A. (1991). Regular Partitioning for Synthesizing Fixed-Size SystoticArrays.Integration, TheVLSIJournal12:293304.5. Darte, A., Rau, B., Vivien, F., Schreiber, R. (1999). A Constructive Solution toJugglingProbleminSystolicArraySynthesis.TechnicalReport1999-15,Lab-oratoiredelinformatiqueduparalle lisme.6. DeMan,H.,Bolsens,I.,Lin,B.,VanRompaey,K.,Vercauteren,S.,Verkest,D. (1997). Hardware andSoftware Codesignof Digital TelecommunicationSystem.ProceedingsoftheIEEE85(3):391418.7. De Micheli, G. (2002). Network on Chip: a new Paradigm for System on ChipDesign. DesignAutomationandTest inEurope. Paris: IEEEComputerSocietyPress,pp.418420.8. Derrien, S., Risset, T. (2000). Interfacing Compiled FPGAPrograms: theMMALPHAApproach. In: Arabnia, A., ed. PDPTA2000: SecondInternationalWorkshop on Engineering of RecongurableHardware/SoftwareObjects. CSREAPress,June.9. Frigo, J., Gokhale, M., Lavenier, D. (2001). Evaluation of the Streams C C-to-FPGA Compiler: an Applications Perspective. In: Ninth international symposiumonFieldprogrammablegatearrays.ACMPress,pp.134140.10. Guillou, A.C., Quinton, P., Risset, T., Massicotte, D. High-Level DesignofDigital Filters in Mobile Communications. DATE Design Contest 2001, March2001. Secondplace, availableat http://www.irisa.fr/bibli/publi/pi/2001/1405/1405.html11. Haykin, S. (1996). AdaptiveFilterTheory. 3rded. Prentice-Hall informationand system sciences series. Upper Saddle River, NJ 07458, USA: Prentice-Hall.12. Katsushige, M., Kiyoshi, N., Hitoshi, K. (1999). Pipilined LMS Adaptive FilterUsingaNewLook-AheadTransformation.IEEETransactionsonCircuitsandSystems,46:5155, January.13. Kienhuis,B.,Rijpkema,E.,Deprettere,E.F.(2000).Compaan:DerivingPro-cess Networks from Matlab for Embedded Signal Processing Architectures. In:8thInternationalWorkshoponHardware/SoftwareCodesign(CODES2000).14. Lavenier,D.,Quinton,P. (1996).SAMBA:SystolicAccelerator for MolecularBiologicalApplication.TechnicalReport988,Irisa,March.15. Le Moenner, P., Perraudeau, L., Rajopadhye, S., Risset, T. (May 1996.).GeneratingRegularArithmeticCircuitswithAlpHard. In: MassivelyParallelComputingSystems(MPCS96).16. Leong, P.H.W., Leong, M.P., Cheung, O.Y.H., Tung, T., Kwok, C.M., Wong,M.Y., LePilchard, K.H. (2001). ARecongurableComputingPlatformwithCopyright2004 by Marcel Dekker, Inc. All Rights Reserved.MemorySlot Interface. In: SymposiumonField-Programmable CustomCom-putingMachines(FCCM).California:IEEEComputerSocietyPress.17. Lieverse, P., VanderWolf, P., Deprettere, F., Vissers, K. (2001). AMethod-ologyfor ArchitectureExplorationof Heterogeneous Signal ProcessingSys-tems. Journal of VLSI Signal Processing for Signal, Image and VideoTechnology29(3):197207.SpecialissueonSiPS99.18. Lin, B., Vercauteren, S. (1994). Synthesis of Concurrent System Interface Mod-ules with Automatic Protocol Conversion Generation. In: InternationalConferenceonComputer-AidedDesign(ICCAD),pp.101109.19. Mauras,C.(1989).Alpha:unlangageequationnelpourlaconceptionetlapro-grammation darchitectures paralle`les synchrones. The` se de doctorat, Ifsic, Uni-versite deRennes1,December.20. Me min, E., Risset, T. (1999). Full AlternateJacobi MinimizationandVLSIDerivationofHardwareforMotionEstimation. In: Int. WorkshoponParallelImageProcessingandAnalysis, IWPIPA99.India:Madras,Jan.21. Mozipo, A., Massicotte, D., Quinton, P., Risset, T. (1999). AParallel Archi-tectureforAdaptativeChannel EqualizationbasedOnKalmanFilterUsingMMALPHA. 1999 IEEECanadian Conference on Electrical &Computer Engi-neering,pp.554559,May.22. Page, I. (1996). ConstructingHardware-SoftwareSystemsfromaSingleDe-scription.JournalofVLSISignalProcessing12:87107.23. Pimentel, A. D., Hertzberger, L. O., Lieverse, P., Van Der Wolf, P., DeprettereE. F. (2001). ExploringEmbedded-SystemsArchitectureswithArtemis. IEEEComputer34(11):5763.24. Quillere , F., Rajopadhye, S. (2000). OptimizingMemoryUsageinthePoly-hedral Model. ACM Transactions on Programming Languages and Systems 22(5):773815.25. Quinton,P., Robert, Y.(1989). Systolic Algorithmsand Architectures. PrenticeHallandMasson.26. Schreiber, R., Aditya, S.G., Rau, B.R., Mahlke, S., Kathail, V., Cronquist, D.,Sivaraman, M. (2001). PICO-NPA: High-Level Synthesis of Nonprogram-mableHardwareAccelerators.JournalofVLSISignalProcessing.27. SynplifyPro7.0ReferenceManual,October2001.28. Weiss, K., Oetker, C., Katchan, I., Steckstor, T., Rosenstiel, W. (2000). PowerEstimationApproachforSRAM-basedFPGAs. In: IEEESymposiumof FieldProgrammableGateArray.29. Wilde, D. (1993) A Library for doing Polyhed

domain-specific processors - s. bhattacharyya et al., (marcel dekker, 2004)

Documents

digital signal processing

headquartersmarcel dekker

university of minnesota

university of copenhagen

wireless communication

robustsignal processing

intelligent sensor systems

c processors systems