performance optimization by placement constraints for fpga...

Performance Optimization by Placement Constraints for FPGA-basedAsynchronous Processors

Jukiya Furushima Tatsuki Otake Hiroshi SaitoThe University of Aizu, Japan The University of Aizu, Japan The University of Aizu, Japan

[email protected] [email protected] [email protected]

Abstract— In this paper, we propose a performance

optimization method for Field Programmable Gate

Array (FPGA)-based asynchronous processors using

placement constraints. The proposed method consists

of two-step placement constraint generation. First is

based on floorplanning to reduce path delays among

resources. Second is to reduce control-path delays in-

side control modules and delay elements considering

the structure of FPGA. The use of generated place-

ment constraints results in not only performance op-

timization but also reduction of delay adjustments by

stabilizing delay variation. In the experiment, we

design asynchronous MIPS processors using the pro-

posed method. The experimental result shows that

the performance is optimized with fewer delay adjust-

ments by the proposed method.

I. Introduction

The needs to implement processors on Field Pro-grammable Gate Arrays (FPGAs) are increased to realizean embedded system on a single FPGA. This may increasesystem performance and reduce system cost. In contrast,the power consumption of FPGAs may increase becauseof the existence of both processor and accelerator.Commercial FPGAs are synchronous circuits where cir-

cuit components are controlled by global clock signals.However, the power consumption of synchronous circuitsis increased when the clock signals are distributed to thewide area with high frequency.Asynchronous circuits are potentially low power con-

sumption compared to synchronous counterparts. This isbecause local handshake signals are used to control circuitcomponents instead of global clock signals. However, toimplement asynchronous processors on commercial FP-GAs is difficult. This is because the design environmentfor commercial FPGAs implies synchronous circuit de-signs. It does not support constraint generation, timingverification, and so on required for asynchronous circuits.There are some researches to implement asynchronous

processors on commercial FPGAs. Minas et.al. proposedan asynchronous RISC based processor with both onlineand offline testing capabilities in [1]. Chen et.al. proposeda pipelined asynchronous 8051 processor in [2]. It was de-signed by using Balsa [3]. Because of the use of Balsa,

different handshake protocols could be realized for theprocessor. Lee et.al. proposed an asynchronous FPGAprocessor for low-power sensor networks in [4]. Comparedto the synchronous counterpart, the battery-life could beextended 60 times when the sleep time was assumed 99% of a mission time. Herrera et.al. implemented an asyn-chronous 8-bit processor on an FPGA in [5]. The architec-ture is simple and easy to understand. These researchesfocused on low power characteristics or testability of asyn-chronous circuits on FPGA. On the other hand, they didnot address how to design asynchronous processors oncommercial FPGAs considering a cycle time constraintto satisfy a timing requirement.In [6], a design method to implement asynchronous pro-

cessors on commercial FPGAs was proposed. Based onthe maximum delay constraints generated from a givencycle time constraint, asynchronous processors were de-signed. However, it did not address the problem of de-lay variation caused by delay adjustment and re-synthesisto satisfy timing constraints. It results in that the per-formance of processors is unpredictable until all timingconstraints are satisfied.In this paper, we propose a performance optimization

method for FPGA-based asynchronous processors usingplacement constraints. The use of generated placementconstraints reduces both control and data-path delays. Italso stabilizes delay variation before/after delay adjust-ment and re-synthesis. It results in the reduction of thenumber of delay adjustments.The proposed method consists of two-step generation

of placement constraints. The first step is the genera-tion of placement constraints for the resources used inasynchronous processors through floorplanning to reducethe path delays among resources. We consider two ap-proaches. The second approach is similar to [7] in thatit considers the characteristic of bundled-data implemen-tation for floorplanning. However, it is different from [7]in that it considers the structure of pipelined processors.The second step is the generation of placement constraintsfor the logic of control modules and delay elements to re-duce control-path delays considering the structure of thetarget FPGA.The rest of this paper is organized as follows. In section

2, we describe asynchronous circuits with bundled-dataimplementation used in this paper. In section 3, we de-

R4-18 SASIMI 2018 Proceedings

- 388 -

Fig. 1. Circuit model of bundled-data implementation.

scribe the proposed method. In section 4, we describe theexperimental results. Finally, in section 5, we concludethis paper with future work.

II. Asynchronous Circuits with Bundled-dataImplementation

Bundled-data implementation is one of data encodingschemes in asynchronous circuits. In bundled-data imple-mentation, N bit data are represented by N + 2 signals.Additional two signals represent request signal req andacknowledge signal ack. The completion of operationsin bundled-data implementation is guaranteed by a de-lay element on req signal. Therefore, the performance ofbundled-data implementation depends on the delay of thecontrol circuit including delay elements.Figure 1 represents the circuit model of bundled-data

implementation used in this paper. It consists of a data-path circuit and a control circuit.The data-path circuit is almost the same as the one

used in synchronous processors. It consists of programcounter (PC), instruction memory (IMEM), register file(RF), decoder, arithmetic logic unit (ALU), data mem-ory (DMEM), and pipeline registers (pipereg). Delay el-ements wdi j,k and hdk are inserted to satisfy simulta-neous register/memory writing constraints and hold con-straints for register k [6]. i represents pipeline stage i(0 ≤ i ≤ m − 1) and j (0 or 1) represents one of twocontrol modules to control a pipeline stage i.The control circuit consists of control modules ctrli j .

A control module ctrli j consists of Q-module qi j [8], de-lay elements sdi j , bdi j , cdi j , and C-element ci j [9]. Q-module is internally-clocked modules that can be used to

Fig. 2. Local cycle time lcti and global cycle time gct.

satisfy delay-insensitive specifications. C-element is a cir-cuit for synchronous input signals. Delay elements sdi j ,bdi j , and cdi j are inserted to satisfy setup constraints forregisters, branch constraints for control branch, and con-trol initialization constraints for control modules ctrli j

[6].

Control modules ctrli j operate as follows. A risingtransition of the input signal from the previous controlmodule triggers the behaviors of control module ctrli j .Through the C-element, it is transferred to qi j as a ris-ing transition of ini j . qi j generates a rising transitionof reqi j . After the delay on the delay element sdi j ispassed, it is returned to qi j as a rising transition of acki j .qi j generates a falling transition of reqi j . It is returnedto qi j as a falling transition of acki j . Finally, the con-trol is passed to the next control module through a risingtransition of outi j . Pipeline registers or memories in thedata-path circuit are controlled by falling transitions ofacki j .

We define global cycle time gct and local cycle time lctifor pipeline stage i. The global cycle time gct is the max-imum value of the local cycle time lcti among all pipelinestages. The local cycle time lcti is defined by the max-imum path delay from sdi−1 1 to the clock pins of theregisters controlled by acki 1 through ctrli 0 and ctrli 1.Fig.2 represents paths to decide lcti and gct.

To guarantee the correct timing of operations, this cir-cuit model must satisfy four types of timing constraints;setup constraints, hold constraints, control module initial-ization constraints, and simultaneous writing constraints[6].

Simultaneous writing constraints guarantee the timingto write data into all registers and memories at the sametime to avoid setup and hold violations. We representthe time for register and/or memory writing as the ref-erence time ref . Control-path delays are adjusted to beref by delay elements wdi j,k. However, to match refis difficult by the addition or removal of cells in wdi j,k.Therefore, we assign both positive and negative margins

- 389 -

Fig. 3. Reference time ref and its positive and negative marginsfor simultaneous writing constraint.

for ref where the timing to write data is tolerated be-tween positive and negative margins. Figure 3 shows threecontrol-path delays for a pipeline stage i with ref andits positive and negative margins. Note that the maxi-mum value of the control-path delays corresponds to lcti.The next instruction can be fetched after ref is passed.Therefore, ref decides the performance of designed asyn-chronous processors.

III. Proposed Method

A. Overview

The proposed method consists of two-step generationof placement constraints to improve the performance ofFPGA-based asynchronous processors. In the first step,we generate placement constraints for the resources usedin FPGA-based asynchronous processors through floor-planning to reduce path delays among resources. In thesecond step, we generate placement constraints for thelogic of control modules and delay elements consideringthe structure of the target FPGA to reduce delays insidecontrol modules and delay elements.Placement constraints will be useful for not only per-

formance optimization but also stabilizing delay varia-tion. In bundled-data implementation, delay adjustmentsare required to satisfy timing constraints such as setupconstraints. Therefore, delays are changed whenever re-synthesis is performed after adding or removing cells usedin delay elements.In this paper, we generate placement constraints as-

suming that the target FPGAs are Intel FPGAs. How-ever, this assumption does not restrict the applicability ofthe proposed method. For other FPGAs, we think thatthe proposed method is also applicable if the structure ofother FPGAs can be analyzed.

B. Generation of Placement Constraints for Resources

We generate placement constraints for the resourcesused in FPGA-based asynchronous processors to reducepath delays among resources. It is performed in the fol-lowing procedure.

1. We obtain the number of required logic elements(LEs) for each resource through the initial synthe-

Fig. 4. Example of floorplans: (a) floorplan based on the firstapproach and (b) floorplan based on the second approach.

sis. For delay elements, we decide the upper boundfor the number of LEs. We add the upper bounds forsdi j , bdi j , and cdi j to the required LEs for controlmodules ctrli j and add the upper bounds for hdkand wdi j,k to the required LEs for registers regk.

2. We decide the placement region size for each resourceby adding a margin for the required LEs.

3. We decide X and Y coordinates for the lower left andupper right of each region on the target device byfloorplanning.

For floorplanning, we consider two approaches. Thefirst approach is to place resources based on the pipelinestructure. The second approach is to change the placeof control modules in the first approach to the center ofplacement. The aim of the second approach is to balancecontrol-path delays in each pipeline stage. It reduces gctwhile minimizing the difference among lcti.Figure 4 represents floorplan examples. Fig.4(a) and

(b) represent floorplans obtained by the first approachand the second approach. Control modules are placed tothe left in the first approach and they are placed in thecenter in the second approach.Figure 5 represents a part of generated placement con-

straints from the first approach. In this example, we as-sume Intel Cyclone IV E FPGA (EP4CE115F29C7) as thetarget device. Logic lock (LL) represents the placementconstraint for each resource. For example, the placementregion of the resource IMEM is the rectangle from (51, 64)to (52, 67) on the device because the coordinate of lowerleft is (51, 64), the width is 2, and the height is 4.

C. Generation of Placement Constraints for the Logicsinside of Control Modules and Delay Elements

We generate placement constraints for the logic usedin control modules and delay elements to reduce delaysinside control modules and delay elements.Before placement constraint generation, we need to

know the internal structure of the target FPGA in ad-vance. Figure 6 represents the internal structure of IntelCyclone IV E FPGA (EP4CE115F29C7). In this FPGA,

- 390 -

Fig. 5. Placement constraints for a resource used in a processor.

Fig. 6. Structure of Intel Cyclone IV E FPGA.

one look-up table and one D flip-flop are included in onelogic element (LE). One Logic Array Block (LAB) con-sists of 16 LEs. Routing resources are classified into localinterconnects and global interconnects [10]. Local inter-connects connect LEs inside LABs and LABs to globalinterconnects. Global interconnects connect LABs in thetarget FPGA. Usually, the wire delays of local intercon-nects are shorter than those of global interconnects.

We generate placement constraints for the logic usedin control modules and delay elements to map them intoLEs of LABs in the order of signal propagation in controlmodules ctrli j . By mapping adjacent logics to adjacentLEs in LABs, we reduce wire delays inside control mod-ules and delay elements.

Figure 7 represents the logic structure of control modulectrli j and the placement in the target FPGA. Figure 8represents the generated placement constraints. By usingset location assignment command, we constrain not onlyX and Y coordinates of a LAB but also the placement ofLE (represented by N∗) in the LAB.

IV. Experiments

In the experiments, we evaluate the performance ofdesigned FPGA-based asynchronous processors by using

Fig. 7. Mapping of control module ctrli j to the target FPGA.

Fig. 8. Placement constraints for the logics in control modules anddelay elements.

generated placement constraints. We use the referencetime for simultaneous writing constraints, ref , as the met-ric of the performance. This is because the interval tofetch instructions depends on ref . In addition, we evalu-ate execution time, area, power consumption, and energyconsumption of the designed FPGA-based asynchronousprocessors.We use the MIPS processor described in [11]. This

MIPS processor consists of five pipeline stages; instruc-tion fetch, instruction decode, execution, memory access,and write back. It also supports nine instructions; loadword, store word, jump, branch on equal, addition, sub-traction, logical or, logical and, and set on less than.For FPGA, we use Intel Cyclone IV E

(EP4CE115F29C7). For synthesis and STA, we useQuartus Prime 16.1 and TimeQuest Timing Analyzer16.1. We prepare scripts and spreadsheets for thegeneration of placement constraints by the proposedmethod, the generation of maximum delay constraints,and the verification of timing constraints.To confirm the effect of the proposed method, we com-

pare the following six MIPS processors.

• Synchronous MIPS processor without placement con-straints (Sync)

• Asynchronous MIPS processor without placement

- 391 -

TABLE Iref and the number of delay adjustments for the designed

asynchronous MIPS processors.

name ref [ns] adjust

Async 15.0 3Asyncfp1 14.8 1Asyncfp2 14.2 3Asyncctr 14.0 3Asyncfp2 ctr 14.4 2

constraints (Async)

• Asynchronous MIPS processor with placement con-straints generated by the first approach of floorplan-ning only (Asyncfp1)

• Asynchronous MIPS processor with placement con-straints generated by the second approach of floor-planning only (Asyncfp2)

• Asynchronous MIPS processor with placement con-straints for control modules and delay elements only(Asyncctr)

• Asynchronous MIPS processor with two-step place-ment constraints (Asyncfp2 ctr)

Note that floorplanning of Asyncfp2 ctr is based on thesecond approach of floorplanning.We extend the design method proposed in [6] for the de-

sign of asynchronous MIPS processors. Initially, we modeland synthesize an asynchronous MIPS processor based onthe synchronous MIPS processor. We decide the num-ber of required logic elements (LEs) for resources fromthe initial synthesis. Using the number of required LEs,we generate placement constraints for resources throughfloorplanning and/or logic used in control modules anddelay elements. In addition to placement constraints, wegenerate maximum delay constraints for all paths frompath delays obtained by the static timing analysis (STA)for the initial synthesis result and a given cycle time con-straint. Synthesis, maximum delay constraint generation,and delay adjustment for asynchronous MIPS processorsare repeatedly performed until all timing constraints forbundled-data implementation are satisfied.Constraint values and parameters used in this experi-

ment are as follows. The value of cycle time constraint ctis the smallest one when Sync is synthesized without anytiming violation. ct is 12 ns. In asynchronous MIPS pro-cessors, simultaneous writing constraints for registers andmemories must be satisfied. Initially, we set 12 ns to ref .If the global cycle time value (gct) obtained from STAovers ref , we modify ref to gct. We set the margin forref of simultaneous writing constraints, the margin fordata-path delays, and the margin for control-path delaysto ± 0.8 ns, 0.6 ns, and 0.6 ns.ref and adjust in Table I represent ref of the designed

asynchronous MIPS processors and the number of de-lay adjustments until all timing constraints are satisfied.

Fig. 9. Change of ref from the beginning to the completion ofdelay adjustments.

ref of Asyncfp2 ctr is better than Async and Asyncfp1while it is worse than Asyncfp2 and Asyncctr. On theother hand, adjust of Asyncfp2 ctr is reduced comparedto Asyncfp2 and Asyncctr.Although Asyncfp2 ctr is the combination of Asyncfp2

and Asyncctr, ref of Asyncfp2 ctr is worse than thoseof Asyncfp2 and Asyncctr. This comes from that dif-ferent wires and different ports of LEs are used whichare automatically decided by Quartus Prime. It meansthat Asyncfp2 ctr is not always worse than Asyncfp2and Asyncctr. In fact, the differences of ref amongAsyncfp2 ctr, Asyncfp2, and Asyncctr are very small.Figure 9 represents the change of ref from the begin-

ning to the completion of delay adjustments. Before rep-resents that ref is set to the clock cycle time of Sync (i.e.,12 ns). Initial represents that ref is changed to gct afterSTA for the initial synthesis completes. 1st re-syn, 2ndre-syn, and 3rd re-syn represent ref after STA for the 1st,2nd, and 3rd re-synthesis by delay adjustments completes.Compared to Async which does not use placement con-straints, each placement constraint generation results instabilizing delay variation during delay adjustment andre-synthesis. This is useful to predict the performanceof FPGA-based asynchronous processors after the initialsynthesis. It is also useful to reduce the number of delayadjustments (i.e., Asyncfp1 and Asyncfp2 ctr).From the above observations, we conclude that

Asyncfp2 ctr can improve performance with fewer delayadjustments.On the other hand, in this experiment, we do not obtain

asynchronous MIPS processors whose performance is bet-ter or equal to Sync (i.e., 12 ns). There are two problemsto be solved. First is to minimize data-path delays. Dif-ferent from Sync where data-path delays are decided byregister-to-register delays, data-path delays in bundled-data implementation are the sum of the delays from sdi j

to registers and register-to-register delays. Second is tominimize the control-path delays from primary input pinsto the first control module. We are going to consider themfor floorplanning of asynchronous processors.Figure 10(a), (b), (c), and (d) represent execution time,

area, power consumption, and energy consumption of thedesigned MIPS processors. Execution time is a simulation

- 392 -

Fig. 10. Experimental results: (a) execution time, (b) area, (c)dynamic power consumption, and (d) energy consumption.

time by using ModelSim-Intel Starter Edition 10.5b whena 3x3 matrix multiplication is performed. The area is thenumber of used logic elements. Dynamic power consump-tion is obtained by using PowerPlay Power Analyzer inQuartus Prime with the simulation result. Energy con-sumption is the product of dynamic power consumptionand execution time.

The execution time of Asyncfp2 ctr is better than thoseof Async and Asyncfp1 while it is worse than those ofAsyncfp2 and Asyncctr. This comes from the result ofref . As the next instructions are fetched after ref ispassed, the execution time becomes better when ref issmaller.

Comparing with Sync, the dynamic power consumptionof the designed asynchronous MIPS processors is almostequal. One of the reasons comes from the increase of area.The area increase mainly comes from restricted optimiza-tion for data-path resources by Quartus Prime althougharea increase by control modules and delay elements isjust 10 %. We are going to investigate how to solve thisproblem checking Verilog HDL models and used designconstraints.

V. Conclusions

In this paper, we proposed a performance optimizationmethod for FPGA-based asynchronous processors usingplacement constraints. The proposed method consists oftwo-step placement constraint generation. The first stepis based on floorplanning to reduce path delays amongresources. We considered two approaches for floorplan-ning and the second approach resulted in better perfor-mance. The second step is to reduce control-path de-lays inside control modules and delay elements consider-ing the structure of FPGA. In the experiments, we de-signed asynchronous MIPS processors using the proposed

method. The experimental result showed that the pro-posed method optimizes performance with fewer delayadjustments.As future work, we are going to propose a modeling

method, constraint generation, and floorplan method togenerate asynchronous processors better than the syn-chronous counterparts.

Acknowledgement

This work is partially supported by Grant-in-Aid forScientific Research from Japan Society for the promotionof science (#15K00080).

References

[1] N. Minas et al., ”FPGA Implementation of an Asyn-chronous Processor with Both Online and Offline TestingCapabilities”, Proc. Async, pp.128-137, 2008.

[2] C-J. Chen, W-M. Cheng, R-F. Tsai, H-Y. Tsai, and T-C.Wang, ”A Pipelined Asynchronous 8051 Soft-core Imple-mented with Balsa”, Proc. APCCAS, pp. 976-979, 2008.

[3] D.Edwards and A. Bardsley”, Balsa : An asynchronoushardware synthesis language”, The Computer Journal,45(1): pp. 12.-18, 2002.

[4] J. H. Lee, Y. H. Kim, and K. R. Cho, ”Designing anAsynchronous FPGA Processor for Low-Power SensorNetworks ”, Proc. ISSCS, pp. 1-6, 2009.

[5] M. Herrera and F. Viveros, ”Asynchronous 8-bit proces-sor mapped into an FPGA device,” 2014 IEEE Colom-bian Conference on Communications and Computing(COLCOM), Bogota, pp. 1-7, 2014.

[6] J. Furushima, M. Nakajima, and H. Saito, ”Design of anAsynchronous Processor with Bundled-data Implementa-tion on a Commercial Field Programmable Gate Array”,Informatica, vol. 40, no. 4, pp. 399-408, 2016.

[7] H. Saito, T. Yoneda, and T. Nayna, ”A Floorplan Methodof Asynchronous Circuits with Bundled-data Implemen-tation for FPGAs”, Proc. ISCAS, pp.925928, 2010.

[8] F. U. Rosenberger, C. E. Molnar, T. J. Chaney, and T.P. Fang, ”Q-Modules:Internally Clocked Delay Insensi-tive Modules”, IEEE TC, vol. C-37, no.9, pp. 1005-1018,1988.

[9] D. E. Muller and W. S. Bartky, ”A theory of asyn-chronous circuits” Proc. International Symposium on theTheory of Switching, pp. 204-243, 1959.

[10] Intel, ”Cyclone IV Device Handbook”,https://www.altera.co.jp/content/dam/altera-www/global/en US/pdfs/literature/hb/cyclone-iv/cyclone4-handbook.pdf

[11] D. A. Patterson and J. L. Hennessy, ”Computer Or-

ganization and Design, Fifth Edition: The Hard-

ware/Software Interface”, Morgan Kaufmann, 2006.

- 393 -

performance optimization by placement constraints for fpga...

Documents