practical vhdl optimization for timing critical fpga applications

11

Click here to load reader

Upload: k-kuusilinna

Post on 02-Jul-2016

222 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Practical VHDL optimization for timing critical FPGA applications

Practical VHDL optimization for timing critical FPGA applications

K. Kuusilinna* , T. Hamalainen, J. Saarinen

Tampere University of Technology, Signal Processing Laboratory, Hermiankatu 12 C, FIN-33720 Tampere, Finland

Received 22 July 1999; received in revised form 29 September 1999; accepted 12 October 1999

Abstract

This paper gives a hands-on example of how low-level optimization of the VHSIC Hardware Description Language (VHDL) code isextremely difficult within a contemporary Field Programmable Gate Array (FPGA) design flow. However, low-level optimization can beaccomplished, and by changing the VHDL coding style synthesis results can be improved. The design flow is considered from high-leveldescriptions (bubble diagrams), through logic synthesis to the point where hand optimization is required. For performance benchmarking astate machine from a contemporary computer bus, PCI, implemented in a Xilinx FPGA, is used. Practical design issues applied to time-critical implementations using FPGAs, especially the trade-offs of high-level versus low-level synthesis, are analyzed. Performance evalua-tion results of several PCI target state machines, coded using different styles and design methods are given in terms of time and areaefficiency. Based on these findings improvements to the FPGA design methodology are proposed.q 1999 Elsevier Science B.V. All rightsreserved.

Keywords: VHDL; Field programmable gate array; Synthesis; PCI; Electronic design automation

1. Introduction

High-level hardware description languages (HDLs) likeVHSIC Hardware Description Language (VHDL) [1] havebecome common in modern digital design. Not only arethese languages able to represent designs at high abstractionlevels, but also considerable reductions in design time havealso been observed, compared to traditional design methods.Schematic capture no longer has the ability to describe theincreasingly complex designs on all relevant levels effi-ciently. On the down side, high-level synthesis is sometimesassociated with reduced efficiency, both in chip area usageand especially in timing performance issues. This is largelybecause behavioural design entry (by definition) masksimplementation details from the designer.

VHDL code optimization can be done on several levels.High-level optimization consists, for example, of partition-ing, scheduling, high-level pipelining, designing logicre-use, and arithmetic optimization. Designs in a highabstraction level are usually not considered very synthesiz-able. However, new and better high-level design tools arebecoming available all the time. So far, the logic synthesis isnormally done from Register Transfer Level (RTL) code.Below that level, it is possible to do direct component

mappings to those components that the technology librarysupports. It is self-evident for an experienced designer thatthe HDL coding style affects these synthesis results. Theemphasis in this paper is on the technology mapping ofthe timing critical parts of Field Programmable GateArray (FPGA) designs. Delay minimization in technologymapping for heterogeneous and bounded resources inFPGAs is NP-hard in the general case [3]. We show someof the design trade-offs between design effort and perfor-mance in VHDL state machine designs. In addition, we tryto give a rough estimate of the kind of improvements thatare attainable with extra work.

By nature, VHDL is a high-level programming language,which supports behavioural descriptions of digital systems.Designs in low abstraction levels tend to get quite awkward,since synthesizable RTL designs are practical only byrestricting the code to certain known formats. Unfortu-nately, the final performance of a design is difficult topredict from the high-level description. One of the tradi-tional methods of correcting this problem is to doin-placeoptimizationto problem areas [10]. Here in-place optimiza-tion means work done to the netlist after the delay valueshave been back annotated. This kind of optimization is timeconsuming, non-portable and may not leave any documen-tation about the work that was necessary to get the designworking. Before in-place optimization the VHDL codingstyle should be re-evaluated, following which the equivalent

Microprocessors and Microsystems 23 (1999) 459–469

0141-9331/99/$ - see front matterq 1999 Elsevier Science B.V. All rights reserved.PII: S0141-9331(99)00062-9

www.elsevier.nl/locate/micpro

* Corresponding author. Tel.:1358-3-3652111; fax:1358-3-3653095.E-mail address:[email protected] (K. Kuusilinna)

Page 2: Practical VHDL optimization for timing critical FPGA applications

of in-place optimization should be done as far as possiblefrom the VHDL code itself.

Advanced FPGAs are now comparable to many of theApplication Specific Integrated Circuits (ASICs) that weredesigned just a few years ago, in terms of operation speedand gate count. Many applications have become attractiveas FPGA designs, because the overall design time in FPGAstends to be shorter than in ASICs. Especially, variouscomputer interfaces like the PCI and embedded computingapplications, for example, to multimedia purposes [8] havebecome prime targets for FPGA implementations. Of

course, the FPGAs are still too expensive for most massproducts. It is also a major contribution to the expenses, ifthe FPGA chip is larger or faster than necessary. Therefore,the trade-offs made in designing the logic inside the FPGAcan have a profound influence on the total costs of products.The high-level and low-level design methodologies mustmeet on some common ground.

The PCI Local Bus [16] serves as a good example of acontemporary design challenge. Well-defined specificationdescribes system operation at 33 or 66 MHz. In the futurePCI might use a clock as fast as 133 MHz, as suggested by

K. Kuusilinna et al. / Microprocessors and Microsystems 23 (1999) 459–469460

Fig. 1. Synthesis based FPGA design flow [10,18].

Page 3: Practical VHDL optimization for timing critical FPGA applications

the PCI-X proposal (according to preliminary information[17]). For reasons like the logical complexity of the inter-face, the requirement of only one chip interfacing the PCIbus per bus agent, and stringent electrical specificationsclearly stretch the limits of FPGA designs.

In our PCI design case, Mentor GraphicsRenoirversion98.3 was utilized for the high-level design entry. The regis-tered next-state and component mapped versions of the statemachine were written by hand. Synopsys FPGA compilerversion 1998.02 was used for simulations and synthesis forall test designs. Xilinx Alliance version M1.4 was used forlogic placement and routing. Xilinx Alliance was also usedto obtain the timing and area information. This FPGAdesign flow is depicted in Fig. 1. Although the steps inFig. 1 are inherently iterative, only two iterations are expli-citly shown. Traditional in-place optimization is avoided asfar as possible, and RTL modelling is required to deliver thenecessary final performance. Correspondingly, possible areaor timing problems are taken back to the RTL modellingphase if the initial modelling is not satisfactory. This rathercomplex tool-flow is intended to portray a generic FPGAdesign methodology.

The contents of this paper are as follows. Section 2discusses the common synthesis problems found in FPGAdesigns. The emphasis is on the non-perfect implementationdecisions that can be the cause of the performance differ-ences between the different VHDL coding styles. Section 3explains briefly the PCI target state machine, so that thereader can understand the design trade-offs that must bemade. Section 4 describes the different coding possibilitiesfor synthesis and the corresponding performance evalua-tions. Based on these findings, improvements to the designmethodology, to better facilitate the use of FPGAs, areproposed in Section 5. Finally, the conclusions are givenin Section 6.

2. Common synthesis problems

Many synthesis problems in FPGAs are deeply rooted inthe inherent granularity of the FPGA architecture. TypicalFPGA mainly consists of basic programmable blocks andthe routing resources between them. For example, Xilinxcall their programmable elements Configurable LogicBlocks (CLBs). Each CLB in the XC4000E family has

three look-up tables (LUTs) and two flip–flops as its maincomponents (Fig. 2 depicts three of these CLBs.) The LUTsin a CLB can collectively handle nine external inputs. Thus,the granularity is relatively large, which seems to be aproblem for the synthesis programs. Especially if the fan-in of the synthesized structure just barely exceeds the logicblock capacity, the relative cost of the structure increasesdisproportionally. A critical path through a HDL design caneasily lie in several different hierarchical blocks and containa considerable amount of logic. This is a difficult situation inall synthesis, but particularly difficult for FPGAs as thegranularity may induce extra delay. [10,18]

In addition, the distribution between combinational logicand flip–flops is probably inconvenient on the FPGA chip.Statistically the distribution may be in good balance, but it isvirtually impossible to implement a specific design in a waythat does not waste the CLB resources to some extent.Synthesizable elements of roughly equal logical complexitymay be significantly different in terms of area and opera-tional speed, when finally implemented on an FPGA. This isalso essentially due to the suitability of an individual struc-ture to the logic block architecture.

Routing delays may also rise to a notable problem. It issometimes possible to model the delay between logic blocksas a constant. This is usually a reasonable approach forFPGAs with global channel routing, where all routinglines are of roughly equal length. In some architecture, therouting lengths can vary considerably and so the time delaysvary. Xilinx XC4000E FPGAs have asymmetric channelrouting architecture and they exhibit these changing timingdelays [7]. These delays are difficult to predict before theactual logic placement and routing are done, though somepromising results have been reported [20]. Routability andperformance have been found to correlate with the netlength distribution and the location of the nets across thechannels [19]. In addition, fan-out affects the delays insignal lines. In CMOS logic, heavily loaded signal linesare slow.

2.1. Optimization issues with VHDL and Xilinx FPGAs

Some of the driving forces to adopt VHDL design entryare independence from target technology, designer produc-tivity and the possibility to design in a high abstractionlevel. These are all good points, but lead to somewhat

K. Kuusilinna et al. / Microprocessors and Microsystems 23 (1999) 459–469 461

Fig. 2. Successful synthesis result and logic mapping to CLBs. (FF denotes flip–flop, and FMAP, GMAP and HMAP are look-up tables).

Page 4: Practical VHDL optimization for timing critical FPGA applications

awkward resource mapping, if such a procedure is necessaryto improve performance. To alleviate this and otherproblems encountered in practical circuit design manyVHDL Initiative Towards ASIC Libraries (VITAL) [6]models and lately Intellectual Property (IP) blocks andother less generic libraries have been introduced. Theselibraries reduce the design effort required for designs inlower abstraction levels.

Another method to customize VHDL to some specificpurpose is to usesynthesis scripts. These scripts transformdesigner knowledge into additional guidance to the synth-esis, in order to achieve better synthesis results. The down-side of this approach is that it is not very portable betweendifferent target technologies. On the positive side, the opti-mization work is documented in the script.

It is possible for a VHDL code to behave differently in asimulation than in the final physical implementation afterthe synthesis [5]. This behaviour becomes increasinglyemphasized in high abstraction level designs. Therefore,designers are usually encouraged to restrict themselves toproven VHDL structures and to maintain a consistentcoding style.

A very successful synthesis result is depicted in Fig. 2.The design consists of two edge-triggered flip–flops, withthree look-up tables worth of combinational logic betweenthem. These area parameters are assumed constant in thefollowing less successful technology mappings. In thisexample, the logic packing to CLBs has succeeded verywell and the routing between CLBs is short. For thepurposes of these examples, this mapping is assumed“ideal.”

As shown in Fig. 2 it is a good idea for the outputs of

sub-designs to be registered. This simplifies the timingbudget design and especially in FPGAs the registers areconsidered better capable of driving the subsequent logic.Fig. 3 depicts a situation where a sub-optimal VHDL modelwas used to describe the design. Some combinational logicis placed after the final flip–flop and the propagation timethrough this look-up table must be subtracted from thetiming budget of the next stage. This problem is particularlytypical for some state machine encodings.

The third case in Fig. 4 depicts a situation, where logicmapping to CLBs has not been very successful. For somereason there is combinational logic placed in the look-uptables of the CLB2. The HDL code should be changed tobetter reflect the CLB structure, or the logic should beforced into CLB3.

Fig. 5 depicts how the signal path can accumulateunnecessary logic levels. The two LUTs that were situatedparallel to each other in the previous cases is now situated inseries. This naturally induces unwanted delay to the signalpath. As in the previous case, a CLB and some routingresources could be saved by having all the combinationallogic in CLB3.

Fig. 6 depicts a routing problem. It is very difficult toinfluence routing directly from the VHDL. Therefore, it isadvisable to keep the logical structure of the design as clearas possible. If the first flip–flop had been in CLB2 instead ofCLB1 the router might have “seen” the obvious solution,and not suggested the long route. It is of course, perfectlypossible for the routing resources to be in short supply,which sometimes causes awkward routing. Sometimeshand-made in-place optimization is the only method thatcan fix routing problems.

K. Kuusilinna et al. / Microprocessors and Microsystems 23 (1999) 459–469462

Fig. 3. Poorly mapped logic after second flip–flop. (FF denotes flip–flop, and FMAP, GMAP and HMAP are look-up tables).

Fig. 4. Poorly mapped logic after first flip–flop. (FF denotes flip–flop, and FMAP, GMAP and HMAP are look-up tables).

Page 5: Practical VHDL optimization for timing critical FPGA applications

The synthesis software may not be aware of all theresources available in the target FPGA chip. This may resultin inefficient use of the FPGA. For example, the XilinxXC4000E family has structures called Wide Edge Decoders,which perform wired-AND operations to multiple inputsignal lines. This structure is available from Synopsyssynthesis only by explicitly specifying such components.A more common case is the Xilinx STARTUP or START-BUF component, which are compulsory to practically alldesigns as they are used to specify the global clock andreset.

3. Case study: PCI target state machine

Fig. 7 is a bubble diagram of the PCI bus target statemachine that is used as an example. This PCI state machinewas chosen because it is small enough to be described herein detail and large enough that the synthesis software doesnot necessarily achieve the optimal result from all designdescriptions. Some of the state transitions are assigned tohappen on signals that are not external inputs to the chip.These signals are aggregates of such external inputs. This isdone to clarify the transitions and does not affect the under-lying principles that this paper presents.

The main function of the PCI target state machine is totrack the following operations. The target should wait for atransaction to begin on the bus. Those transactions that arenot intended for this particular agent should be gracefullydisregarded. Every PCI bus agent has aConfigurationMemory, read and write accesses to these addresses mustbe allowed. For actual data transfer the PCI target must also

be able to respond toI/O or Memory mapped reads andwrites. The state machine in the example responds to theI/O accesses.

In the IDLE state of Fig. 7, the state machine waits for abus transaction to begin. This is indicated by asserting theframeX signal. If the bus command is such that the targetmay not, under any condition, respond to it, or it is of aninappropriate format for a Configuration Memory access,then state transition is made to the Bus Busy (B_BUSY)state. B_BUSY denotes that the bus is busy with a transac-tion to another agent, and the state machine returns to theIDLE state as soon as the frameX is deasserted.

The ADDRESS state allows the target to recognize itsaddress from the bus, and proceed to Send Data(S_DATA), if the target is ready to handle the transaction.The S_DATA state is held until the current accesscompletes. In the event that the address does not match orthe agent is logically removed from the bus the state transi-tion is to the B_BUSY state. The BACKOFF state isreached, if the target, for some reason, needs to abort thecurrent transaction.

The CONFIG state is reserved for read and write accessesto the Configuration Memory. The end of data transfersignals a transition to the Turn Around (TURN_AR) state.The TURN_AR state is responsible for driving certainexternal signals to unasserted values before they are drivento a high-impedance state. The next state after TURN_AR isalways IDLE.

The state machine controls and times all the other activ-ities in the design. Therefore, it is essential that the statemachine functions reliably, without errors. In a 33 MHz PCIbus, from the time that the external inputs are sampled into

K. Kuusilinna et al. / Microprocessors and Microsystems 23 (1999) 459–469 463

Fig. 5. Logic depth has risen to three look-up tables. (FF denotes flip–flop, and FMAP, GMAP and HMAP are look-up tables).

Fig. 6. Inefficient routing. (FF denotes flip–flop, and FMAP, GMAP and HMAP are look-up tables).

Page 6: Practical VHDL optimization for timing critical FPGA applications

registers there are 15 ns before the state machine mustchange its state. This 15 ns interval must encompass theset-up and hold times of the flip–flops, routing delays andthe delay through all the combinational logic in the signalpath. For error free operation in all possible situations, thisrequires special care for state machine design in manyFPGAs.

Table 1 presents different approaches to implement thestate machine in Fig. 7 and their performance records. In thetable “Binary” refers to Mentor Graphics Renoir generatedbinary encoded state machine. “One-hot” correspondinglyis Renoir generated and one-hot encoded. “Registered Next-state” is RTL level code, where all the logic concerning astate transition is collected to one equation [22]. This typeof code is depicted in Fig. 8. “CLB Mapped” refers tothe code that is composed of component mappings and allcomponents are explicitly mapped to CLBs (see Fig. 9). Allcodes are in VHDL, the last two were created by hand, andthe others by Renoir. In the table, “Synopsys” values are

estimates before placement and routing, “Xilinx” values arethe final results. In addition, percentages of CLB decreaseand speed increase from the Binary baseline are given. Codelengths are compared in the final column.

In Fig. 8 the code declares two registers assignments andthe combinational logic preceding these registers. Thus, thesignals “IDLE” and “ADDRESS” come from their respec-tive registers and the combinational logic is gathered beforethe registers. Logically, the code in Fig. 9 corresponds to thecode in Fig. 8, only logic mapping information has beenadded.

Fig. 10 depicts the final result of the code in Fig. 9.Component instances (FDP, FDC, OR2, AND2, AND3and INV), look-up tables (FMAP and GMAP) and all thesignals are depicted in the figure. First, the code in Fig. 9declares the components and their signal mappings. Thenthe FMAP_PUC component specifies that all the combina-tional logic of the IDLE state is collected in a single CLB. Inthe Synopsys specific part (inside the pragma statements),

K. Kuusilinna et al. / Microprocessors and Microsystems 23 (1999) 459–469464

Fig. 7. PCI target state machine as a Mentor Renoir design entry.

Table 1Performance of some PCI target state machine implementations

SynopsysCLBs(estimate)

XilinxCLBs

Synopsysmaximum speed(estimate) (MHz)

Xilinx maximumspeed (MHz)

CLB decreasefrom maximum(%)

Speed increasefrom minimum(%)

Lines ofVHDL code

Binary 14 18 41 37 0 0 218One-hot 9 9 35 46 50 24 191Registered Next-state 9 7 34 62 61 67 118CLB Mapped 21 7 24 82 61 122 414

Page 7: Practical VHDL optimization for timing critical FPGA applications

the synthesis software is told not to make any changes to thisimplementation. Then both the flip–flip and the combina-tional logic are locked into the same CLB with the“xnf_blknm” (block name) attribute. Finally, the CLB inquestion is locked to row 2 column 2 in the FPGA withthe “xnf_loc” attribute. The ADDRESS state mappingsare similar, but no xnf_loc attribute is required, as thismapping was done in the code for the IDLE state.

The space of guidance attributes that can be given tosynthesis is vast. Summed together, Xilinx accepts approxi-mately 50 constraints, like the “loc” above, from variousdesign levels. Likewise, Synopsys can be controlled throughtens, probably hundreds of attributes, variables andconstraints. Of course, not all of the parameters are viablefor a particular design.

4. Synthesis results for the PCI state machine

Mentor Graphics Renoir produced the binary and one-hotencoded state machine VHDL codes from the correspondingbubble diagrams. As mentioned earlier, the registered next-state and component mapped versions of the state machinewere written by hand. Synopsys synthesis was set to mini-mize area, where applicable. Xilinx Alliance was used forlogic placement and routing to Xilinx XC4005E-3 FPGA.The timing and area information from the Alliance was usedto compare these designs.

The binary encoding clearly has the poorest performanceof all the codes, as seen from Table 1. It is heavily penalizedby the FPGA granularity, which has a too low combina-tional logic to registers ratio for this type of encoding. Inaddition, the requirement for the state machine to output thecurrent state in a one-hot fashion is not very suitable forbinary encoding. These problems correspond especially tothe case in Fig. 3 from Section 2. However, the performanceof the binary encoding is used as a baseline to compare theother implementations. The speed of 37 MHz should barelysuffice for a 33 MHz PCI implementation, but requiring 18CLBs is definitively excessive.

The one-hot encoding achieves 50% savings in areacompared to the binary implementation. Speed is onlyimproved by 24% to 46 MHz. Clearly, this encoding ismore suitable for the state machine. Unfortunately, thespeed increase is only moderate. As there still are twomore CLBs than is absolutely necessary, therefore, somesignals have too many look-up tables in their paths. Thisproblem is analogous to those in Figs. 4 and 5. The codelength is approximately equal to the code length of thebinary encoding.

In the registered next-state encoding, the area isdecreased by only 22% from the one-hot. This is due tothe fact that the area has now reached its optimal size of 7in terms of CLBs. Registered next-state coding style expli-citly specifies the state registers and conveniently lumpstogether the logic that should physically belong together.

K. Kuusilinna et al. / Microprocessors and Microsystems 23 (1999) 459–469 465

Fig. 8. Registered next-state encoded IDLE and ADDRESS states.

Fig. 9. CLB mapped IDLE and ADDRESS states.

Page 8: Practical VHDL optimization for timing critical FPGA applications

The problem that this encoding attempts to solve is similarto the problem in Fig. 4. Therefore, the speed is alsoimproved by 35% to 62 MHz. In terms of code length,this is the most compact representation. The code is,however, less readable and requires more concentration ondesign than the previous examples.

With the CLB mapped code it is possible to achieve thebest performance of all the examples presented here. Thearea does not decrease any further as it is already at itsminimum. Area decrease, however, may very well be possi-ble in other situations. The speed performance, on thecontrary, can still be improved. The problems that aresolved by this encoding are largely related to those inFigs. 4 and 6. For the purposes of these examples, thespeed was further increased by 32% to 82 MHz, by rearran-ging the CLBs. It is extremely difficult to say when thespeed has reached its absolute maximum. In terms of codelength, this coding is the most verbose. Relatively this coderequires considerable time to generate, but the codingprocess is quite easy and straightforward if the results andthe code from the registered next-state case are available.

The overall logic placement for the CLB mapped code isshown in Fig. 11, where all the state registers and utilizedlook-up tables are depicted as shaded boxes. Only one CLBcontains just combinational logic, all other combinationallogic has been successfully merged into blocks, whichinclude their corresponding flip–flops. Not explicitlyshown in Fig. 11, but analysing the original data revealsthat the maximum logic depth is three look-up tables.

In addition, Table 1 has Synopsys estimates for CLBusage and operating speed for all the test cases. These valuesare directly derived from the Synopsys timing report, and noeffort was made to seek the actual critical paths. These

estimates can be used to roughly compare how well thesynthesis software understands the corresponding VHDLencoding. From this case, it could be concluded that theSynopsys understands the conventional VHDL quite well,but is incapable of predicting the performance increases inthe lower-level codes. One explanation, especially in theCLB mapped case, might be the Xilinx FMAP and HMAPcomponents, which the software in this case interpretspoorly.

Dramatic performance improvements are available inFPGA designs by adopting suitable design practices. Fig.12 depicts the design approaches from this paper in adescending order of designer involvement. Design projectsshould always balance the effort versus performance to suittheir particular needs. In Fig. 12 the “Behavioural Descrip-tion” can be understood to refer to the bubble diagram inFig. 7 and the binary encoded VHDL code derived from it.“Specific State Machine Encoding”, “Registered Next-stateEquations” and “Logic Mapping to CLBs” correspond to theone-hot encoded, registered next-state encoded and CLBmapped codes, respectively. “CLB Placement” refines theCLB mapping and refers to the “xnf_loc” and “xnf_blknm”attributes seen in Fig. 9. “In-Place Routing” has not beendiscussed in this paper, but it is the next logic step if therouting still requires refinement.

5. Changes to FPGA design flow for timing critical parts

It is clear that the software used for this project does notsupport explicit technology mapping to FPGAs very well. Itis also difficult to distinguish the efficient high-levelconstructs from the inefficient ones. Performance, however,

K. Kuusilinna et al. / Microprocessors and Microsystems 23 (1999) 459–469466

Fig. 10. Logic of IDLE and ADDRESS states mapped into a CLB.

Page 9: Practical VHDL optimization for timing critical FPGA applications

is very dependent on successful mappings, as can be seenfrom Table 1. Though beyond the scope of this paper toprove, this also seems to be the general state of affairswith other high-level FPGA software. The aspects thatseem to have the most adverse effects on FPGA designflow are described below, along with suggestions as tohow they should be avoided.

Documentation should also address the hand optimizationissues. From time to time there will be situations where handoptimization is absolutely necessary for some time to come.These situations include fitting the design into the smallest

FPGA possible for reasons of cost, changes to existingdesigns without changing the hardware, IP and librarycomponents, and hard-to-synthesize special structures. Itis unreasonable that the design engineer should re-inventand reverse engineer the low-level workings of the deviceand the software to perform these duties.

The software does not give the designer information,which could be used to avoid hand optimization. Thenormal way to describe state machines in VHDL is ill suitedfor FPGA designs. It requires exceptional vision to see howdeep the combinational logic is going to be. It is also notexplicitly clear which signals contribute to which parts ofthe combinational logic. It is, therefore, difficult to see thecritical signals whose optimization would be the most fruit-ful. Thus, the EDA software should offer a conversion toregistered next-state logic. In this form, the combinationallogic is clearly encapsulated before the register element,thus, its depth and correlations can be easily analyzed. Asan additional benefit, the VHDL synthesis software has littleleeway to implement this logic, therefore the results arepredictable. The readability of the design decreases, asimplementation issues are incorporated into the design.Therefore, in addition to the conversion between differentrepresentations, the software should support both syntax andlogical checks. These checks are to ensure that the differentrepresentations are logically equivalent and contain nounexpected behaviour.

General synthesis software has typically a poor conceptof the logic element (CLB in Xilinx FPGAs). Therefore, thesynthesis is mainly guided by timing parameters. Signaltiming is derived from delays through discrete components.This describes the look-up table based FPGAs poorly. If theconcept of logic elements were to be better incorporated tosoftware, the synthesis could be additionally guided by thedepth of logic in logic elements. We have found this methodto produce very robust designs, with respect to their timingperformance. Some special FPGA synthesis softwarealready has this property. Combinational and sequentialsynthesis to FPGAs have been discussed in Refs.[2,14,15], respectively.

Mixed design entry methods are still rather poorly imple-mented in most design flows. This is, however, the casewhere we see the most promising developments from theEDA vendors. It is understood that just one design entrymethod is not optimal for all designs, and the designerneeds the ability to intervene in the synthesis process [21].

6. Conclusions

Hardware description languages are labour savingdevices. Sometimes the initial synthesis results fromHDLs and the savings in design effort are not what onewould like them to be. This, however, does not necessarilymean that HDLs could not be used for this particular

K. Kuusilinna et al. / Microprocessors and Microsystems 23 (1999) 459–469 467

Fig. 11. Final state machine logic placement for the CLB mapped code. (FFdenotes flip–flop, and FMAP, GMAP and HMAP are look-up tables).

Page 10: Practical VHDL optimization for timing critical FPGA applications

purpose. Sometimes the better performance just needs extrawork.

In this paper, we have shown that it is possible to follow asliding scale of design effort still within the hardwaredescription language, without escaping to schematiccapture, in-place optimization tools or to any other externaldesign method. All VHDL designs can benefit from well-thought design structure and encoding. FPGA designs canespecially profit from logic optimization to the target tech-nology. Area decrease to less than half and speed increase tomore than double from the initial naive coding wereobserved. Unfortunately, the final performance increaseslie in the details, which steeply raise design time andperceived complexity. Further, the increasing number ofsmall detail makes hand optimization an error pronemethod. Therefore, the type of design style presented hereis not practical, or meant for large designs, but perhaps, forthe small timing critical parts of these designs. We have alsoattempted to indicate the limits of even the most carefultechnology mapping; usually it is more productive toenhance the behavioural description, than to resort to hand

optimization. Some of the benefits and disadvantages of thisdesign method have been tabulated in Table 2.

The concepts in this paper were tested in a PCI bus inter-face design [11]. The target FPGA chip Xilinx XC4005E-3is large enough to contain a PCI interface and the back-endcontrol logic if the ideas presented here are applied. Parti-cularly registered next-state coding was observed to achievelarge area savings. In addition, some critical signal timingsrequired special attention, and direct CLB mapping wasused for the logic, which generated these signals.

VHDL design software was found to be lacking insupport for low-level design entry. To achieve better resultsfrom high-level designs, the software should offer pre-synthesis estimates on design performance. One of possibletools to be used is the registered next-state coding, whichcan better indicate the performance bottlenecks in an FPGAdesign. Now the problem areas have been identified, futureresearch should concentrate on finding ways to characterisedesigns, and to extract useful pre-synthesis information.Presenting this information to the designer in the most effi-cient manner is another open question.

Acknowledgements

This research work has been supported by the TampereGraduate School in Information Science and Engineering(TISE), The Nokia Foundation, and The Ulla TuominenFoundation.

References

[1] ANSI/IEEE Std 1076-1993, IEEE Standard VHDL Language Refer-ence Manual, The IEEE Inc., 1994.

[2] J. Cong, Y. Ding, Combinational logic synthesis for LUT based fieldprogrammable gate arrays, ACM Transactions on Design Automationof Electronic Systems 1 (2) (1996) 145–204.

[3] J. Cong, S. Xu, Delay-oriented technology mapping for heteroge-neous FPGAs with bounded resources, IEEE/ACM InternationalConference on Computer-Aided Design, 8–12 November 1998, SanHose, California, pp. 40–45.

[4] G. Doncev, M. Leeser, S. Tarafdar, Truly rapid prototyping requireshigh level synthesis, Ninth International Workshop on Rapid SystemPrototyping, 3–5 June 1998, Leuven, Belgium, pp. 101–106.

[5] H. Howe, Pre- and postsynthesis simulation mismatches, IEEE Inter-national Verilog HDL Conference, 31 March–3 April 1997, SantaClara, California, pp. 24–31.

[6] IEEE Std 1076.4-1995, IEEE Standard for VITAL Application-Specific Integrated Circuit (ASIC) Modeling Specification, the IEEEInc., 1996.

[7] J. Isoaho, DSP system development and optimization with fieldprogrammable gate arrays (Doctor of Technology Dissertation),Tampere University of Technology, Tampere, 1994.

[8] J. Knuutila, T. Leskinen, System Requirements of Wireless Terminalsfor Future Multimedia Applications, Advances in Information Tech-nologies: the Business Challenge, IOS Press, 1997.

[9] M. Koegst, G. Franke, S. Ru¨lke, K. Feske, Multi-criterial state assign-ment for low power FSM design, 24th EUROMICRO Conference,25–27 August 1998, Va¨steras, Sweden, pp. 261–268.

K. Kuusilinna et al. / Microprocessors and Microsystems 23 (1999) 459–469468

Fig. 12. Relative effort required for various design practices.

Table 2Advantages and disadvantages of explicit technology mapping from VHDL

Advantages Disadvantages

Component instantiation isunambiguous

Describes implementationrather than logic

Detailed description oftiming or area critical

Prone to design errors

Forced logic replication [15] Perceived complexity is highForced logic retiming [15] More lines of VHDL codeComponent that cannot besynthesized

Slow design process

Special structures: Low-power designs [9]Suitable for incremental [12](bottom-up [13]) design

Simulations can become slow[4]

Synthesis is easy and fast. Applicable only to relativelysmall designs or sub-sectionsof larger designs

Page 11: Practical VHDL optimization for timing critical FPGA applications

[10] P. Kurup, T. Abbasi, Logic Synthesis Using Synopsysw, KluwerAcademic Publishers, Boston, 1995.

[11] K. Kuusilinna, T. Hamalainen, J. Saarinen, Field programmable gatearray-based PCI interface for a coprocessor system, Microprocessorsand Microsystems 22 (1999) 373–388.

[12] M. Lehky, S. Bilik, Reducing FPGA design modification time, VHDLInternational Users’ Forum, 19–22 October 1997, Arlington, Virgi-nia, pp. 143–149.

[13] O. Mencer, M. Morf, M. Flynn, PAM-Blox: high performance FPGAdesign for adaptive computing, IEEE Symposium on FPGAs forCustom Computing Machines, 15–17 April 1998, Napa Valley, Cali-fornia, pp. 167–174.

[14] R. Murgai, R. Brayton, A. Sangiovanni-Vincentelli, Sequential synth-esis for table look up PGA’s, Euro ASIC’92, 1–5 June 1992, pp. 32–37.

[15] P. Pan, C. Liu, Optimal clock period FPGA technology mapping forsequential circuits, ACM Transactions on Design Automation ofElectronic Systems 3 (3) (1998) 437–462.

[16] PCI Local Bus Specification, Revision 2.1, PCI Special InterestGroup, USA, 1995.

[17] PCI-X Addendum 1.0 Press Release, PCI Special Interest Group,USA, 1999 (http://www.pcisig.com/newsroom/1999/pcix_review.pdf, last accessed 07/16/1999).

[18] The Programmable Logic Data Book, Xilinx Inc., 1996.[19] K. Roy, S. Nag, Automatic Synthesis of FPGA Channel architecture

for routability and performance, IEEE Transactions on Very LargeScale Integration (VLSI) Systems 2 (4) (1994) 508–511.

[20] M. Schlag, P. Chan, J. Kong, Empirical Evaluation of multilevel logicminimization tools for a lookup-table-based field-programmable gatearray technology, IEEE Transactions on Computer-Aided Design ofIntegrated Circuits and Systems 12 (5) (1993) 713–722.

[21] H. Schmit, L. Arnstein, D. Thomas, E. Lagnese, Behavioral synthesisfor FPGA-based computing, IEEE Workshop on FPGAs for CustomComputing Machines, 10–13 April 1994, Napa Valley, California,pp. 125–132.

[22] S. Wasson, High-speed state machine design, Integrated SystemDesign Magazine July (1995).

K. Kuusilinna et al. / Microprocessors and Microsystems 23 (1999) 459–469 469

Kimmo Kuusilinna received his MSc degree inElectronics in 1996 from the Tampere Universityof Technology, Finland. He has been working asa research scientist since 1996 at TampereUniversity of Technology. His main researchinterests are interconnection networks andcommunication protocols for multiprocessorsystems. Currently he is working towards hisDr Tech degree in a project for on-chip bus solu-tions.

Timo Hamalainen was born in Finland in 1968.He studied analogue and digital electronics,computer architecture and power electronics inthe Electrical Engineering Department atTampere University of Technology (TUT)where he received his MSc degree in 1993. Hisdoctoral research concerned parallel processingof adaptive, intelligent algorithms in a specialmultiprocessor computer. He received his PhDdegree in January 1997. Currently he is workingin the TUT Signal Processing Laboratory as a

senior research scientist. His areas of interest are computer architecturesand wireless terminals for multimedia systems.

Jukka Saarinenwas born in Finland in 1961. Hestudied computer architecture, digital techni-ques, telecommunications and software engi-neering in the Electrical EngineeringDepartment at Tampere University of Technol-ogy where he received his MSc degree in 1986,Licentiate in Technology degree in 1989, andDoctor of Technology degree in 1991. Currentlyhe is the Professor of the Computer EngineeringLaboratory in the Signal Processing Laboratoryat Tampere University of Technology. Hisresearch interests are parallel processing, DSP

architectures, neural networks and hardware for multimedia systems.