1-s2.0-s0141933112001184-main

14
Design and implementation of reconfigurable FIFOs for Voltage/Frequency Island-based Networks-on-Chip Amir-Mohammad Rahmani a,b,, Pasi Liljeberg a , Juha Plosila a , Hannu Tenhunen a a Department of Information Technology, University of Turku, Finland b Turku Centre for Computer Science (TUCS), Finland article info Article history: Available online 16 July 2012 Keywords: Network-on-Chip (NoC) Voltage/Frequency Island (VFI) Globally Asynchronous Locally Synchronous (GALS) Reconfigurable FIFO abstract One of the major design bottlenecks in today’s high-performance VLSI systems is the distribution of a sin- gle global clock across a chip due to process variability, power dissipation, and multi-cycle cross-chip sig- naling. A Network-on-Chip architecture partitioned into several Voltage/Frequency Islands (VFIs) is considered as a promising approach for achieving fine-grain system-level power management. In a VFI-based architecture, a clock is utilized for local data synchronization, while inter-island communica- tion is handled asynchronously. To interface the islands on a chip, operating at different frequencies, a complex bi-synchronous FIFO design is inevitable. However, these FIFOs are not needed if adjacent switches belong to the same clock domain. In this paper, a Reconfigurable Synchronous/Bi-Synchronous (RSBS) FIFO is proposed which can adapt its operation to either synchronous or bi-synchronous mode. Four different scalable and synthesizable designs are presented. In addition, a technique is suggested to show how the FIFO could be utilized in a VFI-based NoC. Moreover, we present a mesochronous adap- tation method and propose Reconfigurable Mesochronous/Bi-Synchronous (RMBS) FIFOs. Our extensive experiments reveal that compared to a non-reconfigurable system architecture, the proposed reconfigu- rable FIFOs can help to achieve up to 17% savings in the average power consumption of NoC switches and 29% improvement in the total average packet latency in the case of an MPEG-4 encoder application. Ó 2012 Elsevier B.V. All rights reserved. 1. Introduction The design constraints imposed by on-chip interconnects in the ultra deep sub-micron (UDSM) regime are major concerns in cur- rent and future high-performance System-on-Chip (SoC) design. These interconnects are quickly becoming a performance impedi- ment from both communication latency and power perspectives. To address these problems, Network-on-Chip (NoC) has been pro- posed as a general purpose on-chip interconnection network be- cause of scalability and better throughput [1,2]. In addition, achieving power efficiency has become an increasingly difficult challenge, especially in the presence of increasing die sizes, high clock frequencies and variability driven design issues. Globally Asynchronous Locally Synchronous [3,4] (GALS)-based NoCs implemented using a multiple Voltage Frequency Island (VFI) de- sign style have become an attractive alternative to traditional de- signs [5]. In fact, VFI-based designs could be used for managing local frequencies or voltages to match prescribed performance lim- its, while minimizing energy consumption. Assignment of frequencies and voltages to VFIs can be done by using either offline or online methods [6]. Offline methods can be used when the behavior of an application is very predictable for various input conditions [7]. However, such an approach is not well suited for applications that show large variations in their behavior for different input conditions. For such systems, online methods are more suitable [6,8]. Dynamic Voltage and Frequency Scaling (DVFS) schemes can be used to adapt the system to meet the per- formance requirements of a dynamically changing workload while minimizing power consumption. On the other hand, in order to benefit from the GALS scheme, communication between islands should be carried out by using mixed-timing (bi-synchronous) FIFOs [9] which adapt clock frequency discrepancy; however, due to the overhead in implementing these FIFOs in terms of latency, area, and power consumption, the associated design complexity increases. The contribution of this paper is twofold. By exploiting reconfig- urable FIFOs, the latency and power consumption overhead of bi- synchronous FIFOs can be considerably decreased in such cases that synchronization is not required. Because the NoC partitioned into VFIs requires a method to embed and take advantage of these FIFOs, we have developed an online controller for input channels of a NoC switch which can adaptively decide the operating mode of 0141-9331/$ - see front matter Ó 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.micpro.2012.07.003 Corresponding author at: Department of Information Technology, University of Turku, Finland. Tel.: +358 443462629; fax: +358 23336950. E-mail address: amirah@utu.fi (A.-M. Rahmani). Microprocessors and Microsystems 37 (2013) 432–445 Contents lists available at SciVerse ScienceDirect Microprocessors and Microsystems journal homepage: www.elsevier.com/locate/micpro

Upload: pnrgoud

Post on 07-Feb-2016

3 views

Category:

Documents


0 download

DESCRIPTION

S21

TRANSCRIPT

Page 1: 1-s2.0-S0141933112001184-main

Microprocessors and Microsystems 37 (2013) 432–445

Contents lists available at SciVerse ScienceDirect

Microprocessors and Microsystems

journal homepage: www.elsevier .com/locate /micpro

Design and implementation of reconfigurable FIFOs for Voltage/FrequencyIsland-based Networks-on-Chip

Amir-Mohammad Rahmani a,b,⇑, Pasi Liljeberg a, Juha Plosila a, Hannu Tenhunen a

a Department of Information Technology, University of Turku, Finlandb Turku Centre for Computer Science (TUCS), Finland

a r t i c l e i n f o

Article history:Available online 16 July 2012

Keywords:Network-on-Chip (NoC)Voltage/Frequency Island (VFI)Globally Asynchronous Locally Synchronous(GALS)Reconfigurable FIFO

0141-9331/$ - see front matter � 2012 Elsevier B.V. Ahttp://dx.doi.org/10.1016/j.micpro.2012.07.003

⇑ Corresponding author at: Department of InformatTurku, Finland. Tel.: +358 443462629; fax: +358 2333

E-mail address: [email protected] (A.-M. Rahmani).

a b s t r a c t

One of the major design bottlenecks in today’s high-performance VLSI systems is the distribution of a sin-gle global clock across a chip due to process variability, power dissipation, and multi-cycle cross-chip sig-naling. A Network-on-Chip architecture partitioned into several Voltage/Frequency Islands (VFIs) isconsidered as a promising approach for achieving fine-grain system-level power management. In aVFI-based architecture, a clock is utilized for local data synchronization, while inter-island communica-tion is handled asynchronously. To interface the islands on a chip, operating at different frequencies, acomplex bi-synchronous FIFO design is inevitable. However, these FIFOs are not needed if adjacentswitches belong to the same clock domain. In this paper, a Reconfigurable Synchronous/Bi-Synchronous(RSBS) FIFO is proposed which can adapt its operation to either synchronous or bi-synchronous mode.Four different scalable and synthesizable designs are presented. In addition, a technique is suggestedto show how the FIFO could be utilized in a VFI-based NoC. Moreover, we present a mesochronous adap-tation method and propose Reconfigurable Mesochronous/Bi-Synchronous (RMBS) FIFOs. Our extensiveexperiments reveal that compared to a non-reconfigurable system architecture, the proposed reconfigu-rable FIFOs can help to achieve up to 17% savings in the average power consumption of NoC switches and29% improvement in the total average packet latency in the case of an MPEG-4 encoder application.

� 2012 Elsevier B.V. All rights reserved.

1. Introduction

The design constraints imposed by on-chip interconnects in theultra deep sub-micron (UDSM) regime are major concerns in cur-rent and future high-performance System-on-Chip (SoC) design.These interconnects are quickly becoming a performance impedi-ment from both communication latency and power perspectives.To address these problems, Network-on-Chip (NoC) has been pro-posed as a general purpose on-chip interconnection network be-cause of scalability and better throughput [1,2]. In addition,achieving power efficiency has become an increasingly difficultchallenge, especially in the presence of increasing die sizes, highclock frequencies and variability driven design issues. GloballyAsynchronous Locally Synchronous [3,4] (GALS)-based NoCsimplemented using a multiple Voltage Frequency Island (VFI) de-sign style have become an attractive alternative to traditional de-signs [5]. In fact, VFI-based designs could be used for managinglocal frequencies or voltages to match prescribed performance lim-its, while minimizing energy consumption.

ll rights reserved.

ion Technology, University of6950.

Assignment of frequencies and voltages to VFIs can be done byusing either offline or online methods [6]. Offline methods can beused when the behavior of an application is very predictable forvarious input conditions [7]. However, such an approach is not wellsuited for applications that show large variations in their behaviorfor different input conditions. For such systems, online methodsare more suitable [6,8]. Dynamic Voltage and Frequency Scaling(DVFS) schemes can be used to adapt the system to meet the per-formance requirements of a dynamically changing workload whileminimizing power consumption. On the other hand, in order tobenefit from the GALS scheme, communication between islandsshould be carried out by using mixed-timing (bi-synchronous)FIFOs [9] which adapt clock frequency discrepancy; however, dueto the overhead in implementing these FIFOs in terms of latency,area, and power consumption, the associated design complexityincreases.

The contribution of this paper is twofold. By exploiting reconfig-urable FIFOs, the latency and power consumption overhead of bi-synchronous FIFOs can be considerably decreased in such casesthat synchronization is not required. Because the NoC partitionedinto VFIs requires a method to embed and take advantage of theseFIFOs, we have developed an online controller for input channels ofa NoC switch which can adaptively decide the operating mode of

Page 2: 1-s2.0-S0141933112001184-main

A.-M. Rahmani et al. / Microprocessors and Microsystems 37 (2013) 432–445 433

the FIFOs. We further present four design styles for implementingreconfigurable FIFOs enabling the bi-synchronous FIFOs to workdynamically in either synchronous or bi-synchronous mode.

The rest of the paper is structured as follows. Section 2 de-scribes the related work. The proposed power-aware VFI-basedvoltage/frequency assignment technique which highlights themotivation of utilizing the reconfigurable FIFOs is described in Sec-tion 3. The architecture of the proposed Reconfigurable Synchro-nous/Bi-Synchronous (RSBS) and Reconfigurable Mesochronous/Bi-Synchronous (RMBS) FIFOs are presented in detail in Section 4.Section 5 shows the experimental results and analyzes the impactof our approach on an MPEG-4 case study. Finally, Section 6 sum-marizes our main contributions.

2. Related work

The distribution of a clock signal in a System-on-Chip (SoC) hasbecome a problem, because of wire length and process variations.Novel approaches such as the globally asynchronous, locally syn-chronous scheme try to solve this by partitioning the SoC into iso-lated synchronous islands [3].

GALS-based communication mechanisms has been the subjectof extensive prior research [10]. In [9], authors utilizes two flip-flopsynchronizers to implement mixed-timing FIFO in order to inter-face systems on a chip working at different frequencies. The FIFOsproposed in [11,12] benefit from Gray and Johnson coding for readand write pointers, respectively, while ring counter has been usedfor the bi-synchronous FIFOs presented in [13,14].

Introducing an efficient technique called voltage islands [15]makes the chip design more efficient by reducing both static anddynamic power consumption of the system. In the voltage islandconcept, processing elements can have power characteristics inde-pendently from the rest of the design, however they use the samevoltage source. This technique enables designers to optimize thepower dissipation of each domain according to system require-ments. As the design complexity of such systems increases withthe number of island, elaborated design methodology is neededto provide efficient voltage island partitioning, voltage level assign-ment and physical-level floorplanning. In [16], Hu et al. addressedthese issues by proposing a annealing-based algorithm.

The concept of voltage island has been expanded to offer furtherpower saving by providing independent frequency level for each is-land (i.e. Voltage Frequency Island) thanks to the GALS technique.There have been several proposals to combine the benefits ofGALS-based NoC interconnect mechanism with VFI design meth-ods [17,18]. The study of voltage and frequency assignment has be-come an important aspect of VFI-based systems. For instance theauthors of [19] present design methodologies for partitioning anNoC architecture into multiple VFIs and assigning frequency, sup-ply voltage, and threshold voltage levels to each VFI according togiven performance constraints at design time. Jang et al. [20] en-chanted this methodology by proposing a systematic VFI-awareenergy optimization framework being able to consider VFI-awarepartitioning, VFI-aware mapping, and VFI-aware routing together.

In recent years, heuristical task mapping techniques have gen-erated interest in VFI-based systems by taking into account otherdesign constraints such as area and temperature characteristics.In [21], in order to minimize energy consumption while satisfyingthe performance constraints, a task mapping technique for hetero-geneous NoC-based system operating at multiple voltage levelswas presented. In this work, a voltage assignment technique hasbeen also proposed which has one-to-one correspondence withthe minimum weight graph coloring theory. In [22], Arjomandet al. proposed a thermal-aware voltage-frequency assignment ap-proach for three dimensional NoCs. The methodology also utilizes

an offline technique to map application tasks to processing ele-ments of a 3D NoC platform considering performance require-ments of the tasks and thermal conductivity to the heat sink.

Recently, there has been wide interest in proposing hardware-based approaches to dynamically change the frequencies andpotentially voltages of a VFI system driven by a dynamic workload[6,8]. For instance, Ogras et al. [23] proposed a voltage-frequencycontrol system to monitor and manage network workload dynam-ically to handle workload variability issues. In other recent work,Bogdan et al. [24], proposed a fractal-based control approach tomanage the power consumption of the VFI-based systems. Thistechnique enables the system to capture fractality and non-sta-tionarity characteristics of workload.

In all the discussed approaches, bi-synchronous FIFOs play acritical role in the design and management of VFI-based systems.Due to the high latency and power overhead, there is a substantiallimitation to freely exploit the bi-synchronous FIFOs. These restric-tions which could be mitigated considerably by utilizing reconfig-urable FIFOs are described in detail using two scenarioshereinafter. In this work, which is major extension of our recentwork published in [25], we propose reconfigurable FIFOs being ableto work in two distinct modes and accordingly devote much moreflexibility to both the dynamic and static voltage/frequency assign-ment techniques at a cost of negligible area overhead.

3. Power-aware VFI-based frequency voltage assignment usingreconfigurable FIFOs

As mentioned in the previous section, partitioning a NoC to sev-eral VFIs for assigning different voltages/frequencies is of increas-ing concern in today’s technologies. To this end, a systemcomprised of a number of synchronous cores, IPs or processing ele-ments (PEs) that can be assigned to VFIs is considered. A VFI canconsist of a single PE or, depending on the physical or design con-siderations, may contain a group of PEs. Each VFI is assumed tohave a voltage level above a certain value Vmin and, since the archi-tecture is globally asynchronous, locally synchronous, each moduleor core is assumed to be clocked by a local ring oscillator or a cen-tral clock generator controlled by the variable intra-island supplyvoltage [4,26]. In such systems, the assignment of voltage/fre-quency to each island can be classified into Static and DynamicVoltage/Frequency Assignment techniques. In the rest of this sec-tion, we will describe the general concept of these techniquesand discuss their main shortcomings.

3.1. Dynamic voltage frequency assignment

In such systems, usually each individual processing element is alocally synchronous module operating with its own clock andeither being a single VFI or forming a VFI with another synchro-nous module. This enables dynamic voltage/frequency scaling ineach synchronous module using a DC–DC voltage regulator and acentral or local variable delay ring oscillator maintaining the clockof the VFI. There are usually a discrete set of frequency and voltagelevels (usually 2–6 levels) assigned by some methods such as fore-casting. As an instance, in prediction based algorithms, based onsome training data such as the number of cycles required by thePE to process past samples, the workload requirement of the PEfor the next sample may be predicted and the local voltage/fre-quency can be scaled accordingly.

For example, Fig. 1 shows frequency levels in a sample 4 � 4mesh-based NoC system for two subsequent periods. Such ascheme would be very useful in applications where the workloadis not overwhelming. Note that the prediction generally is per-

Page 3: 1-s2.0-S0141933112001184-main

Fig. 1. Frequency level of the nodes in a sample 2D Mesh network with 3 VFIs for two consequent periods.

IFC

IC

IRS

/in_data

in_valin_ack

X_dout// X_req{E,W,S,N,L}

X_gnt[3:0]

X_rd[3:0]

RSBSIB

ModeSelector }

}}

X_rokn+2

Syn/Bi-Syn modein_data[n+1]

(Header Flag)

mdin

wr

wok

rok

rd

clk_readclk_write

Fig. 2. Input channel module architecture with support of Reconfigurable Syn/Bi-Syn FIFOs.

434 A.-M. Rahmani et al. / Microprocessors and Microsystems 37 (2013) 432–445

formed at the start of a predefined timing window having length ofw packets.

For the sake of adaptivity in voltage/frequency selection, allnodes should benefit from bi-synchronous FIFOs. Let us considera frequent case in which adjacent cores in a NoC work at a samefrequency level (e.g., they belong to same VFI in current timingwindow). In this situation, despite both read and write clock sig-nals of their FIFOs have equal frequencies, they are still synchro-nized by passing through synchronizer blocks for asserting fulland empty signals. However, these FIFOs can be informed abouttheir equal read and write clock frequencies. A reconfigurable FIFOcapable to operate in both synchronous and bi-synchronousmodes, can cope with this by bypassing and switching off the un-used components (e.g., synchronizers, code converters) and resultto considerable improvement in terms of latency, throughput,and power consumption.

In order to highlight the importance of reconfigurable FIFOs, wehave embedded a simple hardware called Mode Selector in the in-put channel of a RASoC-based [27] NoC switch. As can be seen fromFig. 2, this module is responsible for recognizing the equality ofwrite (provided by output channel of the adjacent switch) and readclock frequencies and directing the buffer to operate in synchro-nous or bi-synchronous mode.

The input channel module shown in Fig. 2 consists of five differ-ent units: IFC (Input Flow Controller), IC (Input Controller), RSBS IB(Reconfigurable Synchronous/Bi-Synchronous Input Buffer), IRS(Input Read Switch), and Mode Selector (which is added to the basicarchitecture of input channel module to support RSBS FIFO). TheRSBS IB block is a dual mode FIFO buffer, while the IC block of eachinput channel performs the routing function, its IRS block receivesx_rd and x_gnt signals and triggers the rd signal of the RSBS IB block,and the IFC block implements the logic that performs the transla-tion between the handshake and the FIFO flow control protocol.

Each channel includes n bits for data and two bits for packet fram-ing: begin-of-packet ((n + 2)th bit), and end-of-packet ((n + 1)thbit). IFC, IC, and IRS modules are described in detail in [27].

To clarify the operation of this system, let us consider a dynamicvoltage frequency assignment (DVFA) unit which can be imple-mented either locally or globally. This unit depends on its decisionsat the start of each timing window H (which may have length of 1to j packets) determines the frequency level of each switch for thenext period. In the current timing window, each switch puts an m-

bit number m ¼ logNumber of frequency levels2

l m� �corresponding to its

frequency level in the header of each packet. The Mode Selectorblock through in_val signal detects a flit is coming and by(n + 2)th bit of in_data signal knows that it is a header flit. Thenit compares the respective m bits of the header flit representingthe frequency level of the adjacent switch with frequency levelof the enclosing router. If the frequency levels are the same (differ-ent), it will inform RSBS IB block to work in the synchronous (bi-synchronous) mode. For each timing window, there is a slightswitching penalty to change the FIFO mode, however the overallnetwork latency improvement achieved in the synchronous modeconsiderably dominates the switching penalty. In addition, sinceMode Selector and IFC blocks work in parallel manner, in fact theembedded block does not affect the router critical-path delay.

Notice that the architecture shown in Fig. 2 is an uncomplicatedexample of utilizing RSBS FIFOs. It could be considered as a noveltechnique which can complement any switch architecture. For in-stance, in order to trigger the Mode Selector block even duringsending packet payload, instead of occupation m header bits, m ex-tra wires between each adjacent switches in each direction can beadded. Moreover, if there is a central DVFA unit, it can be used forassigning frequency to each island as well as selecting the opera-tion mode of the FIFOs.

Page 4: 1-s2.0-S0141933112001184-main

Fig. 3. Reconfigurable Syn/Bi-Syn FIFO style #1 architecture.

A.-M. Rahmani et al. / Microprocessors and Microsystems 37 (2013) 432–445 435

3.2. Static voltage frequency assignment

In the static voltage frequency assignment techniques (SVFA),voltage and frequency assignment process is done for each islandat the design time. Then the VFIs are created after which bi-syn-chronous and synchronous FIFOs are respectively employed for in-ter-island and intra-island communications. According to therecent work in this field [23,7,8], a VFI-based NoC is generally par-titioned into 2–4 islands because if a larger number is used, theoverhead of the bi-synchronous FIFOs will diminish the power sav-ings gained by VFI architecture.

Designs using SVFA techniques are advantageous in the case ofa system where an oracle has pre-existing knowledge of the num-ber of run time cycles used in each PE for processing each sampleof the application under consideration. Moreover, since all of theSVFA stages are done at the design time and for a specific applica-tion, this system is not practical for other applications. Thus, thegeneral purpose multicore systems such as MultiProcessor Sys-tem-on-Chip’s (MPSoC’s) cannot efficiently benefit from the SVFAtechniques. Our proposed reconfigurable RSBS FIFOs can cope withthis issue by providing sufficient reconfigurability for the system.

In the next section, we present the architecture of the proposedRSBS FIFO. It can be seen how this simple technique can signifi-cantly optimize the overall NoC power consumption as well as la-tency and throughput.

4. Reconfigurable synchronous/bi-synchronous FIFO

In this section, we first present three reconfigurable FIFO designstyles (#1, #2, #3) inspired by [13,14,11], respectively and discusstheir utilizations based on benefits and drawbacks of each designstyle. These FIFOs are frequently used in the recently proposedprominent work pertaining to GALS- and especially VFI-based sys-tems. However, to support different read and write clock frequen-cies, the previously proposed FIFOs cause considerableperformance and area overhead to a NoC switch. To this end, weimprove their architectures to provide reconfigurability and pres-ent three reconfigurable FIFOs by adding low-cost components.Meanwhile, since the explanation of all the components of each de-sign style is beyond the scope of this paper, only the parts of thebaseline FIFOs which should be modified to support reconfigurabil-ity are described. More information about the other componentsand signals can be found in [11,13,14]. For further optimization,we propose a novel RSBS FIFO based on Johnson-encoded pointers(Design Style #4). It should be noted that all the proposed designstyles are scalable and synthesizable in synchronous standard cells.Finally, we present a technique for mesochronous adaptation of theRSBS FIFOs in the case where a sender and a receiver have a sameclock frequency but different phases.

4.1. RSBS FIFO design style #1

The first FIFO presented in this section is a bi-synchronous FIFO[13] being able to interface two synchronous systems with inde-pendent clock frequencies and phases which has been used onthe DSPIN NoC [28]. For the sake of metastability [29] avoidanceand synchronization of pointers between two independent clockdomains, it takes advantage of a token-ring-based bubble encodingtechnique. Note that for better reliability in metastability avoid-ance they use two tokens per token ring.

As shown in Fig. 3, similar to most bi-synchronous FIFOs, fivemodules compose the RSBS FIFO architecture: Write pointer, Readpointer, Data buffer, Full detector, and Empty detector. The Writeand Read pointers indicate the read and write positions, the Databuffer contains the buffered data of the FIFO, and the Full and Emptydetectors signal the status of the FIFO.

The FIFO protocol is synchronous; all input and output signalsin the sender and receiver interfaces are synchronous to theirrespective clock signal, Clk_write or Clk_read. In order to providereconfigurability, we have exploited some multiplexers and flip-flops for the Full and Empty detector units used to bypass non-func-tional components in the synchronous mode. For this purpose, weadded Syn/Bi-Syn_Mode signal indicating the operation mode of theFIFO (synchronous or bi-synchronous). The Write and Read pointersare implemented using the described token rings with the bubble-encoding technique. The position of the tokens determines the po-sition of the pointer; therefore in this FIFO, length of Write or Read

pointers are equal to FIFO depth, instead of logFIFO depth2

l m. In the de-

sign style #1, in order to design a RSBS FIFO based on [13], just theFull and Empty detector components should be modified, as a result,the further description of the other components has been omitted.

The Full detector computes the Full signal using the Write_poin-ter and Read_pointer contents. As can be seen from Fig. 4, the Fulldetector performs the logic AND operation between the Write andRead pointers and then collects them with an OR gate, obtaining lo-gic value ‘1’ if the FIFO is Full or quasi-Full, otherwise it is ‘0’. If theFIFO operates in the bi-synchronous mode, this value will be syn-chronized to the Clk_write clock domain into Full_s otherwise thesynchronizer can be bypassed.

The implementation of the Empty detector is similar to the Fulldetector in that both employ the Write and Read pointer contents.As the Empty detector output is correlated to the FIFO throughput,its detection logic has to be optimized, and no anticipation detectorshould be used. Fig. 4 shows the Empty detector for a four wordFIFO. In the proposed design style, first, in case that the Syn/Bi-Syn_Mode signal indicates that the FIFO must work in the bi-syn-chronous mode, the Write_pointer will be synchronized with theread clock into the Synchronized_Write_pointer (SW) using a paral-

Page 5: 1-s2.0-S0141933112001184-main

ENEnable_writeClk_write

Clk_read

MUX

Syn/Bi-Syn Mode

W0 W1 W2 W3

SW0 SW1 SW2 SW3

SNSW(0:3)

Empty

Write Pointer Module

Synchronizer

Wi: Write pointer iRi: Read pointer iSNSWi: Synchronized/Not Synchronized Write pointer iARi: AND Read pointer i

AR0

AR1

AR2

AR3

SNSW0SNSW1

SNSW1SNSW2

SNSW3SNSW2

SNSW3SNSW0

ENEnable_writeClk_write

R0 R1 R2 R3

Read Pointer Module

Clk_write

Full_Bi-Syn

mux

Full_Syn

Full_s

Syn/Bi-Syn Mode

W0

W1

W2

W(N-1)

R0

R1

R2

R(N-1)

AR0 AR1 AR2 AR3

Fig. 4. Full and empty detector details of the design style #1.

436 A.-M. Rahmani et al. / Microprocessors and Microsystems 37 (2013) 432–445

lel synchronizer. Otherwise the Write_pointer will be transferredfor empty detection to respective logic gates without any synchro-nization (W). Next, the Read_pointer is recoded into the AN-D_Read_pointer (AR) using two-input AND gates. The output of ARis a one-hot encoded version of the Read_pointer. Finally, the Emptycondition is detected comparing the SNSW and AR values usingthree-input AND gates.

In the synchronous mode, the synchronizer blocks used to syn-chronize Write_pointer to Clk_read signal (Fig. 4) and Full_signal toClk_write signal can be switched off to prevent unnecessary energydissipation. Depending on the design characteristic, differentpower reduction techniques can be utilized to switch off the idlecomponents such as clock gating [30] and power gating [31]. Thesetechniques have different overheads and implementation require-

ments. Clock gating is one of the most commonly used techniquesfor reducing dynamic power consumption by taking the enableconditions attached to clocked elements and deactivating them.Therefore, the switching power consumption is saved and only sta-tic power is incurred. There are different ways to apply clock gatingtechnique such as including enable conditions in the RTL code orinserting Integrated Clock Gating (ICG) cells into the design manu-ally or automatically. On the other hand, there is variety of waysfor reducing leakage power consumption. Power gating is a popu-lar technique in which a sleep transistor is inserted between actualground rail and circuit ground. The technique shuts down idleblocks in order to cut-off the leakage path.

There are system trade-offs between achieving the leakagepower saving in the low power mode and the energy dissipation

Page 6: 1-s2.0-S0141933112001184-main

SYNC

set

write

full

clk

Full

Put_token

Write_enable

MUX

Syn/Bi-Syn Mode

Fig. 6. RSBS FIFO put control block.

A.-M. Rahmani et al. / Microprocessors and Microsystems 37 (2013) 432–445 437

to switch between normal and low power modes. For the RSBSFIFOs, the power reduction technique should be chosen based ondifferent factors such as length of the timing window and area ofthe components needed to be switched off. For instance, for shortFIFOs with a short rows of synchronizers, the clock gating approachis recommended, while power gating can be utilized for large FIFOsused in lengthy timing windows. It should be noted that as will beshown in the experimental results, RSBS FIFOs offer better latencyand throughput characteristics for the system leading to reducedaverage packet latency (APL) of networks. The APL improvement it-self results in reduced energy consumption per packet. It also de-creases the static power consumption because packets get stuckin FIFOs for a shorter period of time. In fact, even without usingany power reduction techniques, RSBS FIFOs are beneficial in termsof energy consumption as well.

The first proposed design style is a simple way to implementRSBS FIFOs and it has negligible area overhead compared to [13]as a result of the maximum employment of existing components.In addition, as a result of bypassing synchronizers in both fulland empty detection stages, it has considerable latency andthroughput improvements when it operates in the synchronousmode; hence it can be an applicable FIFO architecture to utilizein DVFA methods. However, it is not an appropriate option forSVFA techniques, because it uses ring counters for write and readpointers and compared to typical synchronous FIFOs, it requiresp Flip-Flops instead of dlogp

2e where p is the buffer depth. As a re-sult, it consumes more power and is not reasonable to use withall NoC switches. In addition, to differentiate between a full andan empty FIFO, this FIFO only can be filled up-to the second-to-lastposition, which means that a p stage FIFO can only hold p-1 items.

4.2. RSBS FIFO design style #2

Fig. 5 shows the RSBS FIFO design style #2 based on [14]. In thisstyle, a sender communicates with the FIFO through the put inter-face and the receiver through the get interface. The FIFO consists ofa ring of stages, where each stage has its own put interface, getinterface, full-empty control and data storage cells. The put interfacecells implement a token ring where a token is passed from one cellto the next each time a data value is written into the FIFO. Like-

Put Interface

Cell

Put Interface

Cell

PutInterface

Cell

GetInterface

Cell

GetInterface

Cell

GetInterface

Cell

DataStore

Full-Empty Control

Data Store

Full-Empty Control

Data Store

Full-Empty Control

Put Interface

Get Interface

Receiver

Sender

Fig. 5. Reconfigurable Syn/Bi-Syn FIFO style #2 architecture.

wise, the get interface cells implement a token ring whose tokenmarks the cell for the next data read operation.

In this design style, to construct a RSBS FIFO, we have modifiedGet and Put Interface Cells of [14]. In each Put (Get) Interface Cell,there is a block called Put (Get) Control Block which handles syn-chronization of the full (empty) signal that can fall asynchronouslywith respect to put_clk. Fig. 6 shows the Put Control Block of a PutInterface of a RSBS FIFO. It raises write_enable if the full signal islow, and consists of a synchronizer, a Flip-Flop, an AND gate, anda multiplexer. The full signal, depending on Syn/Bi-Syn_Mode signal,can be synchronized or not by the multiplexer. The write_enableoutputs from all FIFO stages are combined together into thespace_avail signal for the sender (see Fig. 10 of [14]). Note that, toprevent asynchronous change of the value of Syn/Bi-Syn_Mode sig-nal in put (get) control block, a register clocked by put_clk (get_clk) isrequired. As shown in Fig. 7, the synchronizer can consist of anynumber of half-cycle and full-cycle synchronizing stages.

The bypassing technique used in the design style #2 is very sim-ilar to its #1 counterpart; however it has some advantages anddrawbacks compared to the prior style. Likewise the design style#1, this style has negligible area overhead in comparison with itsbaseline FIFO; however in the design style #1, in order to samplethe pointers properly and consequently prevent metastability, weuse two tokens for the read pointer and two tokens for the writepointer. This makes the empty detector a bit more complicated thanthe design style #2, as it requires comparing two synchronized writepointer bits with two read pointer bits for each stage. On top of that,in long-depth FIFOs, the large-fan-in OR and AND logic gates in fulland empty detectors have negative influence on both footprintand latency. Therefore, as can be seen in the Section 5, the designstyle #2 has smaller footprint than the previous style for large FIFOs.In addition, a p stage FIFO using design style #2 can hold p items,which is more efficient than the style #1. On the other hand, per-stage synchronizers in both put and get control blocks makes the style#2 more complicated to exploit in modest FIFO designs.

Similar to the design style #1, due to use of a token ring for fulland empty detectors, this proposed style is suitable only for DVFAtechniques, especially when high-speed and large FIFOs are re-

flopset

flopset

flopset

Transparent Latch

clk

in

set

out0

1

Fig. 7. Cycle synchronizer with asynchronous set [14].

Page 7: 1-s2.0-S0141933112001184-main

FIFO Memory(Dual Port RAM) FIFO

rptr &empty

FIFOwptr &

full waddr

wptr

wq2_rptr rq2_wptr

rptr rincrempty

rclkrrst_n

wclkwrst_n

winc

raddr

rdataatadw

wfull wclken

unsync.full/emptydetector

MUX

MUX

emptyfull

unsynchronized_emptyunsynchronized_full

Syn/Bi-Syn Mode sync_w2rsync_r2w

Fig. 8. Reconfigurable Syn/Bi-Syn FIFO style #3 architecture.

438 A.-M. Rahmani et al. / Microprocessors and Microsystems 37 (2013) 432–445

quired (8 � 32 or larger). Using this FIFO instead of [14] in DVFAtechniques results in considerable power savings and latencyimprovement in VFI-based systems because there are two synchro-nizer units in each buffer stage (slot) which can be bypassed andturned off during the synchronous mode.

4.3. RSBS FIFO design style #3

So far, two alternative design styles with their pros and conswere described. In this part, we present the third design style forimplementing RSBS FIFOs based on [11]. In the third proposed de-sign style, after bypassing and switching off the useless compo-nents in the synchronous mode, in order to have a simplesynchronous FIFO (not anything else such as token ring, code con-verter, etc.) we have added an extra component to detect fullnessand emptiness. Fig. 8 shows the design style #3 architecture con-sisting of six main modules. The FIFO Memory block is a buffer ac-cessed by both the write and read clock domains. This buffer ismost likely an instantiated, synchronous dual-port RAM but other

Binaryreg

d q

Bito

lo

binbnext

inc

!full or!empty

clk

rst_n

n

n+

Bito

lo

full orempty

Fig. 9. FIFO wptr & fu

memory styles can be adapted to function as the FIFO buffer. Thesync_r2w (sync_w2r) module is a synchronizer used to synchronizethe read (write) pointer into the write (read)-clock domain in thebi-synchronous mode. The FIFO rptr & empty block is completelysynchronous to the read-clock domain and contains the FIFO readpointer and empty-flag logic. Similarly, the FIFO wptr & full blockis completely synchronous to the write-clock domain and containsthe FIFO write pointer and full-flag logic. While the embeddedunsynchronized full-empty detector module is responsible for gener-ating full and empty flags in the synchronous mode.

The architecture of the FIFO rptr & empty (FIFO wptr & full) blockis shown in Fig. 9. This module consists of a counter and a register(Binary register) used to address the FIFO memory directly withoutthe need to translate memory addresses into Gray codes. In addi-tion, it exploits a binary to Gray converter and another register(Gray register) which generates the n-bit Gray-code pointer to besynchronized into the opposite clock domain and also one Empty(Full) Detector block to check emptiness (fullness) of the FIFO inthe bi-synchronous mode.

Grayreg

d qnaryGraygic

addr[n-2:0]

ptr[n-1:0]

n-1

nn

gnext

Emptyor Full

Checkerwq2_rptr

orrq2_wptr

Grayreg

d qnaryGraygic

nn

gnext

Emptyor Full

Checker

ll block diagram.

Page 8: 1-s2.0-S0141933112001184-main

FIFO Memory(Dual Port RAM) FIFO

rptr &empty

FIFOwptr &

full

waddr

Johson-code wptr Johson-code rptr

rinc

rempty

rclkrrst_n

wclkwrst_n

winc

raddr

wdata rdata

wfull wclken

MUX

MUX

Syn/Bi-Syn Mode

Fig. 10. Reconfigurable Syn/Bi-Syn FIFO style #4 architecture.

A.-M. Rahmani et al. / Microprocessors and Microsystems 37 (2013) 432–445 439

In our third design style, once the FIFO receives the command tooperate in the synchronous mode via Syn/Bi-Syn Mode signal, theblocks in Figs. 8 and 9 highlighted with gray circles are removedfrom the FIFO path and switched off. Since in the synchronousmode it is not necessary to synchronize the pointers into oppositeclock domains, we add the unsynchronized full-empty detector blockto produce the empty and full flags using only the binary pointers.In contrast, as the mode changes to bi-synchronous, the blocks inFig. 8 shown in black color are bypassed and turned off and thefullness and emptiness are again checked using the gray pointersand synchronizers.

FIFOs based on the design style #3 can be filled up to the lastposition, because empty and full detectors utilize an extra bit ineach pointer. When the write pointer is incremented past the finalFIFO address, the write pointer will assert the unused MSB whilesetting the rest of the bits back to zero and the same is done withthe read pointer. As can be seen from Fig. 9, in contrast with theprevious styles, the design style #3 uses pointers with

logFIFO depth2

l mþ 1 bits and therefore it can be argued that this design

style is practically a simple synchronous FIFO when the unusedcomponents are turned off in the synchronous mode. In addition,although in this style some small components are added, it stillhas a smaller footprint than other mentioned styles (see Section 5).

This style supports only FIFO capacities that are powers of two,but in case that this condition can be tolerated by the system, itcould be a suitable option for VFI-based NoC systems when usingeither DVFA or SVFA techniques. It should be emphasized that insuch NoC systems which benefit from SVFA techniques, the posi-tion of islands are constant. These NoC systems are not appropriateto be mapped for various applications at different times. Therefore,it is desirable to have such synchronous FIFOs for intra-island com-munication which do not have extra latency and power consump-tion overhead. As a result, if the area overhead of the inactivecomponents is acceptable for the system, exploiting a FIFO usingthe design style #3 is quite effective.

Table 14-bit Johnson encoding.

Count Q0 Q1 Q2 Q3

0 0 0 0 01 1 0 0 02 1 1 0 03 1 1 1 04 1 1 1 15 0 1 1 16 0 0 1 17 0 0 0 1

Fig. 11. 4-bit Johnson encoding implementation.

4.4. RSBS FIFO design style #4

The Gray-encoded reconfigurable FIFO for the design style #3offers many advantages in terms of power consumption and per-formance. However it still suffers from latency and power over-heads due to existence of pointer counters and complexity offullness/emptiness checking logic. Moreover, as mentioned before,the Gray-encoded design style can only support FIFO capacitiesthat are powers of two. The presented FIFO in this section is capa-ble of overcoming the aforementioned issues. In contrast with allthe presented design styles, in this section we illustrate novel emp-tiness and fullness detectors (FIFO wptr & full and rptr & emptyblocks) which are not based on the existing bisynchronous FIFOs.The design style #4 is based on Johnson encoding and has substan-tial advantages over the aforementioned design styles. Firstly, itprovides more efficient reconfigurability to bi-synchronous FIFOsto prevent their associated power and latency overheads in suchcases that their synchronizers are not needed. Secondly, in additionto register based implementation of Johnson-based FIFOs usingone-hot addressing, it supports standard memory based imple-mentation addressed by normal binary code.

As shown in Fig. 10, similar to most bi-synchronous FIFOs, fivetypical modules compose the Johnson-based RSBS FIFO architec-ture: FIFO Memory block, sync_r2w, sync_w2r, FIFO rptr & empty,and FIFO wptr & full. In the proposed design style, to provide recon-figurability, we have exploited two multiplexers and two flip-flopsto bypass unused components in the synchronous mode. For thispurpose, we added the Syn/Bi-Syn_Mode signal indicating the oper-ation mode of the FIFO (synchronous or bi-synchronous). Before

describing the main function of the RSBS FIFO, let us first focuson the properties of Johnson encoding [12] for the FIFO read andwrite pointers and the internal structure of the FIFO wptr & full(FIFO rptr & empty) block.

As discussed, Gray code poses some limitations in terms of theimplementation complexity. The first reason is that Gray code al-lows encoding only ‘‘power of two’’ ranges, while the FIFO sizemay be optimal at a value that is not a power of two. The secondlimitation is that in contrast to binary encoding with the full-adderstandard-cell, there is no elementary logical operator to perform anaddition in Gray encoding. Hence, the increment of the pointersneeds to be hardwired at the cost of more area and lower perfor-mance. In order to cope with this issue, we use Johnson encodingfor read and write pointers in this design style.

Johnson encoding is also a code with a Hamming distance of 1between consecutive elements which allows a safe synchroniza-tion of the pointers. To implement the sequence, bits are chainedin series as in a shift register, and the loop is closed using an inver-ter, so that the least significant bit is implemented as the negationof the most significant bit. Table 1 and Fig. 11 show an example of4-bit Johnson encoding and implementation with initial registervalues of 0000, respectively. As shown in Fig. 11, a N-stage Johnsonencoder circulates a single data bit giving sequence of 2N different

Page 9: 1-s2.0-S0141933112001184-main

Binaryreg

d q

Johnsonto Binary

logic

inc

!full

clkrst_n

addr[ -1:0]

ptr[n:0]

n

n

FullChecker wq2_rptr

FullChecker

full

Johnsonreg

22log n

22log n

Fig. 12. FIFO wptr & full block diagram.

Clk_read

MUX

Meso/Bi-SynMode

SW0 SW1 SW2 SW3

SLSW(0:3)

MUX

delay Sync1

Sync2

W0 W1 W2 W3

Fig. 13. Mesochronous adaptation for the design style #1.

FIFO Memory(Dual Port RAM) FIFO

rptr &empty

FIFOwptr &

full waddr

rptrwptr

rincrempty

rclkrrst_n

wclkwrst_n

winc

raddr

rdataatadw

wfull wclken

MUX

MUX

Syn/Bi-Syn Mode

delaydelay

Fig. 14. Mesochronous adaptation for the design style #3.

440 A.-M. Rahmani et al. / Microprocessors and Microsystems 37 (2013) 432–445

states and can therefore encode 2N values by N bits. To differenti-ate the FIFO fullness and emptiness, a parity bit to the binary point-ers is added for virtually doubling the addressing range of thepointers [32]. This parity method is extensible both to Gray andJohnson encodings, but for Johnson encoding, because of thetwisted-ring sequence, it is simpler. When Johnson encoding isused, the buffer is empty if write_pointer = read_pointer, and it isfull if write_pointer = NOT read_pointer.

The architecture of the FIFO wptr & full (FIFO rptr & empty) blockis shown in Fig. 12. This module consists of a Johnson-encoded reg-ister to generate the n-bit pointer to be synchronized into theopposite clock domain. In addition, it exploits a Johnson to binaryconverter and another register (Binary register) used to address theFIFO memory directly without the need to translate memory ad-dresses and also one Full (Empty) Detector block to check fullness(emptiness) of the FIFO.

As a conclusion of this section, we now list the contributions ofour proposed RSBS FIFOs and highlight their advantages over otheravailable bi-synchronous FIFOs. Firstly, we have presented a gener-ic technique which is capable of providing reconfigurability forexisting bi-synchronous FIFOs which leads to considerable latencyimprovement and power savings for the VFI-based systems. Weapplied this technique to three popular bi-synchronous FIFOs aspresented in the above Design Style #1 to #3 subsections. Eachof the three enhanced reconfigurable FIFOs are suitable for differ-ent types of applications. Secondly, we have proposed our ownreconfigurable FIFO which is based on Johnson encoding. Sincethe proposed Johnson-encoded FIFO has been redesigned fromgrounds up to provide reconfigurability, the shortcomings of theprevious RSBS FIFOs have been addressed in this design style.Our methodology enables designers to add reconfigurability totheir existing designs or choose our proposed Johnson-encodedRSBS FIFO. Finally, in the following, it is shown that the proposedRSBS FIFOs can be easily adapted to be used in the Mesochronousmode.

4.5. Mesochronous adaptation

In some cases, it is not possible to exploit a central clock gener-ator for NoC-based systems (specifically for DVFA-based ones). Inthese situations, each node has its own clock generator (phase-locked loop) and the FIFO architecture should be adapted to inter-face mesochronous clock domains where the sender and the recei-ver have the same clock frequency but different phases. To thisend, the RMBS (Reconfigurable Mesochronous/Bi-Synchronous)FIFO should be utilized instead of the RSBS one. The phase differ-ence can be constant or slowly varying. According to [33], metasta-bility can be avoided when the rising edges of the clock signals are

predictable, and the two rows of registers in the synchronizer canbe reduced to a single row of registers.

As an example, Figs. 13 and 14 show part of the proposed designstyles #1 and #3 which we have modified for correct emptinessdetection in the mesochronous mode, respectively. In this case,the first row of registers is clocked using a delayed version of theread clock in the mesochronous mode. This delay must be chosento exchange the data without metastable situations. The delaycan be a programmable delay, or any other metastability-free solu-tion, as for example in the Chakraborty-Greenstreet [34] architec-ture which allows the FIFO to work also on plesiochronous (small

Page 10: 1-s2.0-S0141933112001184-main

A.-M. Rahmani et al. / Microprocessors and Microsystems 37 (2013) 432–445 441

difference of frequency) clocks. Likewise, if the write and readclocks are out of phase by 180� (clock-inverter), no programmabledelay is needed because, by-construction, the communication isfree of metastability. If the difference of phase varies between90� and 270�, the interface is also free of metastability. This meso-chronous adaptation of the bi-synchronous FIFO is simple and al-lows switching between the mesochronous and bi-synchronousmodes.

5. Analysis and case study

To assess the efficiency of the proposed RSBS FIFOs, a synthesiz-able model for each style and its baseline FIFO has been developed.We have simulated the reconfigurable FIFOs based on each designstyle to characterize their latency, throughput, area, and powerconsumption. Note that, to observe the power efficiency of theFIFOs, we have employed them in a NoC-based MPEG system asa case study and compared to a similar system using conventionalbi-synchronous FIFOs.

5.1. Latency analysis

As the sender and the receiver have different clock signals, thelatency of the FIFO depends on the relation between these two sig-nals. The latency can be decomposed in two parts: the state ma-chine latency and the synchronization latency. As the state-machine is designed using a Moore automaton, its latency is oneclock cycle. In bi-synchronous mode, s rows of registers composethe synchronizers and its latency is DT plus one clock cycle, whereDT is the difference, in time, between the rising edges of senderand receiver clocks. As this difference is between zero and oneClk_read clock cycle, the latency of the RSBS FIFO for all three styles

datain_busreq_put

space_availput_token_bus

dataout_busreq_get

data_validget_token_bus

clk_put

Put Interface

Get Interface100 010 001 100

001

clk_put

FF 5A 12 A2

00 5A

Fig. 15. Latency analysis of conven

datain_busreq_put

space_availput_token_bus

dataout_busreq_get

data_validget_token_bus

clk_put

Put Interface

Get Interface100 010 001 100

100 010 001

clk_put

FF

00

5A 12 A2

5A 12

Fig. 16. Latency analysi

is between s and s + 1 Clk_read clock cycles in the bi-synchronousmode. Obviously, in the synchronous mode, there is no differencebetween Clk_read and Clk_write and also there is not any synchro-nizer, and hence data can be fetched by the receiver on the next ris-ing/falling edge of Clk_read.

As an example, Fig. 15 shows a simulation plot of a three stageFIFO with a 2-cycle synchronizer. The put and get clocks have thesame frequency and zero phase offset. Once the first put requestis issued, data is written to the first FIFO stage on the next risingclock edge. Three clock cycles later data_valid rises, allowing thereceiver to consume the data. By this time, three more FIFO writeshave taken place and the space_avail signal drops to notify the sen-der that the FIFO is full. Another three clock cycles later, signalspace_avail rises and the sender can write to the first FIFO stageagain. The put request is not serviced until this time. On the con-trary, Fig. 16 shows a simulation plot of a three stage RSBS FIFOused in the synchronous mode. At the moment the data is writtento the first FIFO stage, one clock cycle later data_valid rises allow-ing the receiver to consume the data.

When a RMBS FIFO is used, the latency of the bi-synchronousFIFO is reduced, because a single register can replace the two-reg-ister synchronizer. In addition, the DT is constant as the phase dif-ference is constant. In that case, the latency of the FIFO is one clockcycle plus DT.

5.2. Throughput analysis

The throughput of the FIFOs for each design style and eachoperation mode has been analyzed as a function of the FIFO depth.For all the design styles in the bi-synchronous and mesochronousmodes, as the synchronizers add latency, the performance of flowcontrol of the FIFO is penalized. In the case of a deep FIFO, thoselatencies do not decrease the FIFO throughput since the buffered

010

100001010

12 A2 5A A2

tional bi-synchronous FIFOs.

010

100 010

45

A2 45

s of the RSBS FIFO.

Page 11: 1-s2.0-S0141933112001184-main

Table 2Minimum FIFO depth in function of the clock relation and required throughput.

Style Mode Minimumdepth for

Minimumdepth for 100%throughput

Design style #1 Bi-syn. mode 5 6Meso. mode 4 5Syn. mode – 4

Design style#2, #3 & #4

Bi-syn. mode 3 6Meso. mode 2 4Syn. mode 1 2

442 A.-M. Rahmani et al. / Microprocessors and Microsystems 37 (2013) 432–445

data compensate the latency of the flow control. As the FIFO, basedon any of the design styles, operates on the synchronous mode, theminimum FIFO depth required to provide maximum throughputdecreases because there is no need for synchronizers. Table 2shows the minimum FIFO depth for 50% and 100% throughput asa function of the clocking mode. Note that for the bi-synchronousmode analysis, the write and read clock frequencies are equal,otherwise it is not possible to obtain 100% throughput. Meanwhile,there is an empty slot in the table because the minimum possibleFIFO depth for the design style #1 is four.

5.3. Area calculation

The area of the FIFOs was computed once synthesized on CMOS90 nm GPLVT STMicroelectronics standard cells using Synopsys De-sign Compiler. Different FIFO depths are used to illustrate the sca-lability of the architectures. Table 3 shows the area of the 16 and32-bit RSBS FIFOs for each design style and the baseline bi-syn-chronous FIFOs as a function of the FIFO depth. Note that for thedesign style #4, there is no comparison to any baseline FIFO owingto its novelty. In addition, to clarify the impact of synchronizers on

Table 3Area and overhead comparison between the RSBS and RMBS design styles and their basel

Style 4 � 16 (lm2) 4 � 32

Panades et al. [13] 3600 6511Design style #1 3635 6546RSBS FIFO (0.98%) (0.54%Design style #1 3664 6570RMBS FIFO (1.79%) (0.91%

Ono et al. [14] 4266 6866Design style #2 4332 6933RSBS FIFO (1.54%) (0.97%Design style #2 4382 6992RMBS FIFO (2.73%) (1.84%

Cummings [11] 3326 5779Design style #3 3434 5888RSBS FIFO (3.25%) (1.88%Design style #3 3530 5983RMBS FIFO (6.13%) (3.53%

Design style #4 3331 5795RSBS FIFODesign style #4 3420 5877RMBS FIFO

Table 4Total number of synchronization components.

Design style #1

Number of synchronization components (p + 1) � (z + f)

each design style, Table 4 formulizes the total number of synchro-nization components used in each design style; let p be the bufferdepth, z be the number of full synchronizer (registers) rows, and fbe the number of half synchronizer (transparent latches) rows.

For the design styles #1 and #2, even though there is no needfor additional components to detect emptiness and fullness, thereis still a slight area overhead compared with the baseline FIFOs.In the case of the design style #3 there is a negligible area overheadwhich decreases when the FIFO depth increases. Compared to thedesign style #3, the Johnson-based RSBS FIFO (design style #4)has lower area overhead due to removed pointer counters and sim-plified fullness/emptiness detection logic. Note that for the designstyle #1, a p stage FIFO can only hold p � 1 items.

5.4. Power consumption and latency evaluation of a VFI-based NoCrunning MPEG-4 decoding system

We apply the proposed RSBS-FIFO-based switch to the NoC-based MPEG-4 decoder shown in Fig. 17 [35] and compare it witha similar system which does not benefit from the reconfigurableFIFOs. The MPEG-4 decoder system is modeled and mapped on a5 � 3 NoC. In the system of Fig. 17, each node has a 5 � 5 crossbarswitch based on the RASoC [27] NoC switch; one port of the cross-bar switch is connected to the functional block and the other fourports are used to interconnect neighboring modules. In the figure,the numbers above links connecting nodes present average band-width between them. Since MPEG videos show a lot of variabilityin processing time depending on the type of frame being pro-cessed, we perform prediction-based dynamic voltage/frequencyassignment on each node based on the DVFA algorithm proposedin [8]. The prediction decision is taken at the start of processingof a new macroblock at each node, and for each input channelwe add a synchronous/bi-synchronous mode selector unit basedon the architecture explained in Section 3.1. The simulation is per-

ine designs.

(lm2) 8 � 16 (lm2) 8 � 32 (lm2)

7117 12,9397187 13,009

) (0.99%) (0.54%)7264 13,089

) (2.06%) (1.16%)

7004 10,9067105 11,007

) (1.43%) (0.93%)7319 11,123

) (3.01%) (1.99%)

6237 11,2986354 11,415

) (1.88%) (1.04%)6509 11,570

) (4.36%) (2.41%)

6267 11,332

6416 11,463

Design style #2 Design style #3 & #4

2 � p � (z + f) 2� dlogp2 þ 1e � ðzþ f Þ

Page 12: 1-s2.0-S0141933112001184-main

Bits Stream

In

3d GFX Rasterizer

Set-upSRAM 2

RISC CPU

CAM (keyword look-up)

Video Out

DRSDRAM

Up-sampling

MCE padding

SRAM 1BAB

Scaling

Audio Out

Media CPU

Audio DSP

iDCT & etc.

40

0.5

40.5

4060.5 5000.5

205702942191

250

25040600

Fig. 17. Mapping of MPEG-4 decoding system on a mesh NoC: Average bandwidths between two nodes are presented. (MB/s).

A.-M. Rahmani et al. / Microprocessors and Microsystems 37 (2013) 432–445 443

formed for three different frequency sets having 2, 4, and 6 fre-quency levels. We assume that the switches are clocked by a cen-tral clock generator block; therefore they do not need themesochronous adaptation. In addition, for all the RSBS FIFOs, weconsider two-row synchronizers. In this work, clock gating is usedas power saving method; although more power could be saved byexploiting transistor level techniques as power gating (note thatthe loss of states in registers in sleep mode should be considered).

Fig. 18 shows the average power saving percentages of the NoCswitches achieved by exploiting the RSBS FIFOs instead of the con-ventional bisynchronous FIFOs. The comparison is made between

0

5

10

15

20

Design Style #1 Design Style #2

Pow

er S

avin

gs (%

)

FIFO Desig

VFI_dyn_RSBS (2 Levels) VFI_dyn_RSBS

Fig. 18. Average power savings for the N

0

5

10

15

20

25

30

8×16Lat

ecy

impr

ovem

ent

(%)

FIFO size (number of sl

VFI_dyn_RSBS (2 Levels) VFI_dyn_RSB

Fig. 19. Average latency improvement for the N

each design style and its baseline counterpart for three differentfrequency sets. We compare the design style #4 with the conven-tional bisynchronous FIFO presnted in [11] because of their simi-larity in the main structure. As the results show, we get around3.5–17% savings over the baseline architectures. As expected, whenthere are fewer frequency levels, the possibility of operating in thesynchronous mode increases and more power can be saved. Notethat in this simulation, all the FIFOs have eight 16-bit slots.

The simulation also has been performed to determine the per-centage of average packet latency improvement for two differentdata widths. The packet latency is defined as number of cycles

Design Style #3 Design Style #4

n Style

(4 Levels) VFI_dyn_RSBS (6 Levels)

oC switches used in MPEG Encoder.

8×32

ots ×data width)

S (4 Levels) VFI_dyn_RSBS (6 Levels)

oC-based system running MPEG Encoder.

Page 13: 1-s2.0-S0141933112001184-main

444 A.-M. Rahmani et al. / Microprocessors and Microsystems 37 (2013) 432–445

from the creation of the first flit at a source node to the momentwhen the last flit is delivered to a destination node. Whereas, eachpacket during the traversal from a source to a destination may passfrom several VFIs having different frequencies, therefore to per-form a correct latency assessment, we measure the latency of apacket for each region separately, and subsequently merge themeasured latency values. It was assumed that packets had a fixedlength of five flits, and the data width was set to 16 and 32 bits foreach set of results. As the latency is measured by the number of cy-cles and all the design styles have the same number of synchro-nizer rows, the simulation has been performed only for the NoCusing the design style #3 and its baseline for three different fre-quency levels. It can be concluded from the simulation resultsshown in Fig. 19 that around 8.5–29% latency improvement overthe baseline architecture can be gained especially for the VFI-basedsystem with a few number of frequency levels.

6. Conclusion

In this paper, the reconfigurable synchronous/bi-synchronousFIFO that can be used to improve power and performance charac-teristics of the VFI-based NoC, was presented. The FIFO, which isable to work in the synchronous and bi-synchronous mode, ad-dress the squandered synchronization power and latency overheadfor the case that adjacent switches in the NoC system operate inthe same clock frequency but suffer from unnecessary synchroni-zations. The FIFOs are scalable and synthesizable in synchronousstandard cells. A technique for mesochronous adaptation of theFIFOs was suggested. Moreover, we presented different techniquesto describe how the FIFO could be utilized in the VFI-based NoC.For this purpose, four different design styles were developed andsynthesized with 90 nm process library and thoroughly analyzedin terms of power consumption, latency, throughput, and area.Our results revealed that compared to a non-reconfigurable systemarchitecture, a NoC system using the proposed FIFOs and the oper-ation mode controller is able to run MPEG-4 encoder applicationwith considerably higher performance and lower power consump-tion at a cost of negligible area overhead.

Acknowledgment

The authors wish to acknowledge the financial support by theAcademy of Finland, Ulla Tuominen Foundation, and Nokia Foun-dation during the course of this project.

References

[1] A. Jantsch, H. Tenhunen (Eds.), Networks on Chip, Kluwer AcademicPublishers., 2003.

[2] L. Benini, G. De Micheli, Networks on chips: a new SoC paradigm, Computer 35(1) (2002) 70–78.

[3] D.M. Chapiro, Globally-Asynchronous Locally-Synchronous Systems, PhDthesis, Stanford University, October 1984.

[4] J. Muttersbach, T. Villiger, W. Fichtner, Practical design of globally-asynchronous locally-synchronous systems, in: Proceedings of the 6thInternational Symposium on Advanced Research in Asynchronous Circuitsand Systems, 200, pp. 52–59.

[5] D.E. Lackey, P.S. Zuchowski, Thomas R. Bednar, D.W. Stout, S.W. Gould, J.M.Cohn, Managing power and performance for system-on-chip designs usingvoltage islands, in: Proceedings of the 2002 IEEE/ACM InternationalConference on Computer-aided Design, 2002, pp. 195–202.

[6] P. Choudhary, D. Marculescu, Power management of voltage/frequency island-based systems using hardware-based methods, IEEE Transactions on VeryLarge Scale Integration Systems 17 (2009) 427–438.

[7] Chen-Ling Chou, U.Y. Ogras, R. Marculescu, Energy- and performance-awareincremental mapping for networks on chip with multiple voltage levels, IEEETransactions on Computer-Aided Design of Integrated Circuits and Systems 27(10) (2008) 1866–1879.

[8] K. Niyogi, D. Marculescu, Speed and voltage selection for GALS systems basedon voltage/frequency islands, in: Proceedings of the 2005 Asia and SouthPacific Design Automation Conference, 2005, pp. 292–297.

[9] T. Chelcea, S.M. Nowick, Robust interfaces for mixed-timing systems, IEEETransaction on Very Large Scale Integration Systems 12 (2004) 857–873.

[10] S. Dasgupta, A. Yakovlev, Comparative analysis of GALS clocking schemes, IETComputers Digital Techniques 1 (2) (2007) 59–69.

[11] C. Cummings, P. Alfke, Simulation and synthesis techniques for asynchronousFIFO design with asynchronous pointer comparison, in: Proceedings of theSNUG-2002, 2002.

[12] Y. Thonnart, E. Beigné, P. Vivet, Design and implementation of a GALS adapterfor ANoC based architectures, in: Proceedings of the 15th IEEE Symposium onAsynchronous Circuits and Systems, 2009, pp. 13–22.

[13] I. Miro Panades, A. Greiner, Bi-synchronous FIFO for synchronous circuitcommunication well suited for network-on-chip in GALS architectures, in:Proceedings of the First International Symposium on Networks-on-Chip, 2007,pp. 83–94.

[14] T. Ono, M. Greenstreet, A modular synchronizing FIFO for NoCs, in:Proceedings of the 3rd ACM/IEEE International Symposium on Networks-on-Chip, 2009, pp. 224–233.

[15] D.E. Lackey, P.S. Zuchowski, T.R. Bednar, D.W. Stout, S.W. Gould, J.M. Cohn,Managing power and performance for system-on-chip designs using voltageislands, in: Proceedings of the 2002 IEEE/ACM International Conference onComputer-Aided Design, 2002, pp. 195–202.

[16] J. Hu, Y. Shin, N. Dhanwada, R. Marculescu, Architecting voltage islands incore-based system-on-a-chip designs, in: Proceedings of the 2004International Symposium on Low Power Electronics and Design, 2004, pp.180–185.

[17] J. Quartana, S. Renane, A. Baixas, L. Fesquet, M. Renaudin, GALS systemsprototyping using multiclock FPGAs and asynchronous network-on-chips, in:Proceedings of the International Conference on Field Programmable Logic andApplications, 2005, pp. 299–304.

[18] G. Campobello, M. Castano, C. Ciofi, D. Mangano, GALS networks on chip: anew solution for asynchronous delay-insensitive links, in: Proceedings of theDesign, Automation and Test in Europe Conference and Exhibition, 2006, pp.1–6.

[19] U.Y. Ogras, R. Marculescu, P. Choudhary, D. Marculescu, Voltage-frequencyisland partitioning for GALS-based networks-on-chip, in: Proceedings of theIEEE Design Automation Conference, 2007, pp. 110–115.

[20] W. Jang, D. Ding, D.Z. Pan, A voltage-frequency island aware energyoptimization framework for networks-on-chip, in: Proceedings of theInternational Conference on Computer-Aided Design, 2008, pp. 264–269.

[21] P. Ghosh, A. Sen, A. Hall, Energy efficient application mapping to NoC processingelements operating at multiple voltage levels, in: Proceedings of the ACM/IEEEInternational Symposium on Networks-on-Chip, 2009, pp. 80–85.

[22] M. Arjomand, H. Sarbazi-Azad, voltage-frequency planning for thermal-aware,low-power design of regular 3-D NoCs, in: Proceedings of the InternationalConference on VLSI Design, 2010, pp. 57–62.

[23] U.Y. Ogras, R. Marculescu, D. Marculescu, E.G. Jung, Design and management ofvoltage-frequency island partitioned networks-on-chip, IEEE Transactions onVery Large Scale Integration Systems 17 (2009) 330–341.

[24] P. Bogdan, R. Marculescu, S. Jain, R.T. Gavila, An optimal control approach topower management for multi-voltage and frequency islands multiprocessorplatforms under highly variable workloads, in: Proceedings of Sixth IEEE/ACMInternational Symposium on Networks on Chip, 2012, pp. 35–42.

[25] A.-M. Rahmani, P. Liljeberg, J. Plosila, H. Tenhunen, An efficient VFI-based NoCarchitecture using Johnson-encoded reconfigurable FIFOs, in: Proceedings ofthe IEEE International Norchip Conference, 2010, pp. 1–5.

[26] L.S. Nielsen, C. Niessen, Low-power operation using self-timed circuits andadaptive scaling of the supply voltage, IEEE Transaction on Very Large ScaleIntegration Systems 2 (1994) 391–397.

[27] C.A. Zeferino, M.E. Kreutz, A.A. Susin, RASoC: a router soft-core for networks-on-chip, in: Proceedings of the Conference on Design, Automation and Test inEurope, vol. 3, 2004, pp. 198–203.

[28] A. Sheibanyrad, I. Miro Panades, A. Greiner, Systematic comparison betweenthe asynchronous and the multi-synchronous implementations of a networkon chip architecture, in: Proceedings of the Conference on Design, Automationand Test in Europe, 2007, pp. 1090–1095.

[29] R. Ginosar, Fourteen ways to fool your synchronizer, in: Proceedings of the 9thInternational Symposium on Asynchronous Circuits and Systems, 2003, pp.89–97.

[30] B.V.N. Silpa, A. Shrivastava, K. Gummidipudi, P.R. Panda, Power-EfficientSystem Design, Springer, 2010.

[31] S. Mutoh, T. Douseki, Y. Matsuya, T. Aoki, S. Shigematsu, J. Yamada, 1-V powersupply high-speed digital circuit technology with multithreshold-voltageCMOS, IEEE Journal of Solid-State Circuits 30 (8) (1995) 847–854.

[32] R.W. Apperson, Z. Yu, M.J. Meeuwsen, T. Mohsenin, B.M. Baas, A scalable dual-clock FIFO for data transfers between arbitrary and haltable clock domains,IEEE Transaction on Very Large Scale Integration Systems 15 (2007) 1125–1134.

[33] F. Mu, C. Svensson, Self-tested self-synchronization circuit for mesochronousclocking, IEEE Transactions on Circuits and Systems II: Analog and DigitalSignal Processing 48 (2) (2001) 129–140.

[34] A. Chakraborty, M.R. Greenstreet, Efficient self-timed interfaces for crossingclock domains, in: Proceedings of the 9th International Symposium onAsynchronous Circuits and Systems, 2003, pp. 78–88.

[35] Erik B. Van Der Tol, Egbert G.T. Jaspers, Mapping of MPEG-4 decoding on aflexible architecture platform, in: Media Processors 2002, 2002, pp. 1–13.

Page 14: 1-s2.0-S0141933112001184-main

A.-M. Rahmani et al. / Microprocessors and Microsystems 37 (2013) 432–445 445

Amir-Mohammad Rahmani received his Master’sdegree in computer architecture from Department ofElectrical and Computer Engineering, University ofTehran in 2009. He is currently pursuing his research inEmbedded Computer and Electronic Systems Labora-tory, University of Turku, Finland and has a Ph.D. posi-tion in Turku Center for Computer Science (TUCS). He isexpected to receive the Ph.D. degree in 2012. Hisresearch interests include low-power design, networks-on-chip, multiprocessor architectures, fault tolerance,thermal management, multi-processor system-on-chip,reconfigurable system design and 3D ICs. Amir has

published more than 50 peer-reviewed papers in international prestigious books,journals and conferences. He is the Assistant Editor-In-Chief of the InternationalJournal of Design, Analysis and Tools for Integrated Circuits and Systems (IJDATICS).

He has served as General Chair for PDP-DaRMuS’13 track and DATICS-NPC’12,DATICS-BCFIC’12, DATICS-ISOCC’12 and DATICS-NESEA’12 workshops, and ProgramCommittee Member in several conferences, including PDP, ICESS, DATICS-IMECS,NESEA, OCPNBS, HPCS, and DRNoC.

Pasi Liljeberg received his M.Sc. and Ph.D. degrees inelectronics and communication technology from theUniversity of Turku, Turku, Finland, in 1999 and 2005,respectively. He is an Adjunct Professor in embeddedcomputing architectures at the University of Turku,Embedded Computer Systems laboratory. During theperiod 2007–2009 he held a fixed-term Academy ofFinland researcher position. His current research inter-ests include adaptive energy efficient embedded sys-tems, embedded computing platforms, agent-basedsystem design, intelligent network-on-chip communi-cation architectures, thermal-aware design aspects, and

reconfigurable system design. He has established and is leading a research groupfocusing on reliable and fault tolerant self-timed communication platforms formultiprocessor systems, FastCop project, Academy of Finland.

Juha Plosila is an Associate Professor in EmbeddedComputing at the University of Turku, Finland. Hereceived M.Sc. and Ph.D. degrees in Electronics andCommunication Technology from the University ofTurku in 1993 and 1999, respectively. He is the leader ofthe Embedded Computer and Electronic Systems (ECES)research unit and a co-leader of Resilient IT Infrastruc-tures (RITES) research program at Turku Centre forComputer Science (TUCS). He is an Associate Editor ofInternational Journal of Embedded and Real-TimeCommunication Systems (IJERTCS) published by IGIGlobal. His research work focuses on adaptive network-

on-chip (NoC) based parallel systems at different abstraction levels, with a specialfocus on emerging 3D stacked multiprocessor systems.

Hannu Tenhunen received the Diplomas from HelsinkiUniversity of Technology, Finland, 1982 and Ph.D. fromCornell University, NY, 1986. In 1985, he joined SignalProcessing Laboratory, Tampere University of Technol-ogy, Finland, as Associate Professor and later served asprofessor and department director. Since 1992, he hasbeen with Professor in Royal Institute of Technology(KTH), Sweden where he also served as dean. Currentlyhe is director of Turku Centre for Computer Science,Finland and at University of Turku. His current researchinterests are VLSI architectures and systems, especiallyNetwork-on-Chip systems. He has over 600 reviewed

publications and 16 patents internationally.