mrfpga: a novel fpga architecture with memristor …cadlab.cs.ucla.edu/~cong/papers/nano11.pdfusing...

mrFPGA: A Novel FPGA Architecture with Memristor-Based

Reconfiguration

Jason Cong Bingjun Xiao

Department of Computer Science

University of California, Los Angeles

{cong, xiao}@cs.ucla.edu

Abstract— In this paper, we introduce a novel FPGA architecturewith memristor-based reconfiguration (mrFPGA). The proposedarchitecture is based on the existing CMOS-compatible memristorfabrication process. The programmable interconnects of mrFPGAuse only memristors and metal wires so that the interconnects canbe fabricated over logic blocks, resulting in significant reductionof overall area and interconnect delay but without using a 3D die-stacking process. Using memristors to build up the interconnects canalso provide capacitance shielding from unused routing paths andreduce interconnect delay further. Moreover we propose an improvedarchitecture that allows adaptive buffer insertion in interconnectsto achieve more speedup. Compared to the fixed buffer pattern inconventional FPGAs, the positions of inserted buffers in mrFPGAare optimized on demand. A complete CAD flow is provided formrFPGA, with an advanced P&R tool named mrVPR that wasdeveloped for mrFPGA. The tool can deal with the novel routingstructure of mrFPGA, the memristor shielding effect, and thealgorithm for optimal buffer insertion. We evaluate the area, perfor-mance and power consumption of mrFPGA based on the 20 largestMCNC benchmark circuits. Results show that mrFPGA achieves5.18x area savings, 2.28x speedup and 1.63x power savings. Furtherimprovement is expected with combination of 3D technologies andmrFPGA.

Keywords-FPGA; memristor; ASIC; reconfiguration;

I. INTRODUCTION

The performance/power efficiency of an application implemented in

an application specific integrated circuit (ASIC) can be as much as

six orders of magnitude higher than its counterpart coded in CPU

[1]. However the rapid increase of non-recurring engineering cost

and design cycle of an ASIC in nanometer technologies makes it

impractical to implement most applications in ASIC. This makes

the Field Programmable Gate Array (FPGA) increasingly more

popular. FPGA is a desirable type of integrated circuit that can be

reconfigured to realize a large range of arbitrary functions according

to customer demands. The programmability makes FPGA a quick and

reusable hardware implementation platform for specific applications.

Compared to ASIC, FPGA has sacrificed its performance to some

extent for programmability [2] but FPGA can still have orders of

magnitude improvement in performance/power efficiency compared

to CPU [1, 3]. The flexibility and performance of FPGA compared to

ASIC and CPU makes FPGA an important component in the realm

of customizable heterogeneous computing platforms [3].

It is well known that the programmable routing structure in FPGA

is the principal source of the performance inferiority of FPGA

when compared to ASIC [4–7]. It is reported that the programmable

interconnects in FPGA can account for up to 90% of the total area [4],

up to 80% of the total delay [5, 6] and up to 85% of the total power

consumption [7]. The study in [2] measures the gap between FPGA

and ASIC. It shows that the area, delay and power consumption of

FPGA are ∼21 times, ∼4 times and ∼12 times as much as those

of ASIC respectively. We see that if FPGA routing structure (the

dominant part in FPGA) gains some improvement, the gap between

FPGAs and ASICs will be significantly reduced.

Conventional FPGA suffers a lot from its programmable intercon-

nects partly due to extensive use of SRAM-based programming bits,

multiplexers (or pass transistors), and buffers in the interconnects.

One SRAM cell contains as many as six transistors, but can store

only one-bit data. The low density of SRAM-based storage increases

the area overhead of FPGA programmability and consequently, leads

to longer routing paths and larger interconnect delay. In addition

SRAM is a type of volatile memory – this means that it contributes

to excessive power consumption during stand-by.

Emerging technologies, especially emerging non-volatile memory

(NVM) technologies, lead to opportunities of circuit improvement.

Emerging NVMs typically have a smaller cell size than that of

SRAMs. They also have the desirable property of non-volatility,

which means that they can be turned off during stand-by to save

power. Among them, memristor, also often recognized as resistive

random-access memory (RRAM), is considered to be the most

promising one. Since HP Labs presented the first experimental

realization of memristor in 2008 [8], rapid progress on the fabrication

of high-quality memristor has been achieved in the past few years.

The current leading memristor technology for the time being has

outperformed many other emerging NVMs in some important aspects

and performs similar to them in the other aspects. It has been

demonstrated that memristor is scalable below 30nm [9], can be

fabricated with device size as small as F2 (F is the feature size) with

full compatibility of CMOS [10], and can be programmed within

5ns at 180nm technology node [11]. Due to the advantages of its

device property over SRAM and other emerging NVMs, numerous

research efforts have been made to adopt memristors for system-

level integration [12–15]. However almost all of these studies use

memristors as a new type of memory.

This paper presents a novel FPGA architecture with memristor-

based reconfiguration (mrFPGA). With the novel use of memristors

in mrFPGA interconnects rather than memory cells, the proposed

architecture reduces the gap between FPGAs and ASICs in area,

delay and power.

This paper is organized as follows. Section II introduces the

conventional FPGA architecture and review recent research work

on FPGAs with emerging technologies. Section III presents the

unique structure of memristor and its electrical characteristics and

circuit model. Section IV describes the architecture of mrFPGA

from high-level overview to detailed design. Section V provides a

complete CAD flow and evaluation method for mrFPGA. Section VI

presents detailed evaluation results of mrFPGA using the largest

twenty MCNC benchmarks. Finally we draw some conclusions in

Section VII.

1978-1-4577-0994-4/11/$26.00 c©2011 IEEE

II. BACKGROUND AND RELATED WORK

A. Conventional FPGA

Fig. 1 shows a conventional FPGA architecture. It is made up of

Fig. 1: Conventional FPGA architecture.

an array of tiles, and each tile consists of one logic block (LB), two

connection blocks (CB) and one switch block (SB). Each logic block

contains a cluster of basic logic elements (BLEs), typically look-

up tables (LUTs), to provide customizable logic functions. LBs are

connected to the routing channels through CBs and the segmented

routing channels are connected with each other through SBs. Fig. 1

also depicts two typical circuits designs of CBs and SBs based on

multiplexers (MUXs) and buffers described in [16]. The selector pins

of each multiplexer are connected to a group of 6-transistor SRAM

cells to have their connectivity determined. The circuits in Fig. 1

have copies of up to the number of pins per side of LBs in a single

CB, and up to the number of tracks per channel in a single SB.

We see that CBs and SBs make up of the interconnects of FPGAs

with much larger area and higher complexity compared to the direct

interconnects of ASIC.

B. Recent Work on FPGAs using Emerging Technologies

With the recent development of emerging technologies, a number

of novel FPGA architectures based on these technologies have been

proposed in the past few years. A 3D FPGA architecture was pro-

posed in [17]. It partitions the transistors in the routing structure and

logic structure of conventional FPGA into multiple active layers and

stacks the layers via monolithic stacking, a 3D integration technology

the author proposes to use as shown in Fig. 2a. It provides 1.7x

performance gain due to the smaller tile area and shorter interconnect

distance between tiles. However 3D stacking without proper thermal

solution, will result in the excessive increase of heat density and the

corresponding degradation of performance [19]. Also applications

of monolithic stacking are currently limited due to the reason that

high temperature required by the fabrication of transistors in an

upper layer is likely to destroy the metals wires and transistors

which have been fabricated in the layers below [18]. In [14] and

[18] two FPGA architectures are proposed using emerging NVMs

and through-silicon-via (TSV) based 3D integration. Similar to the

monolithic stacking in [17], they also move all the transistors in the

routing structure of FPGA to the top die over the logic structure

die with TSV connections between them as shown in Fig. 2b. In

addition they replace SRAM in FPGA with emerging NVMs, such

as phase-change RAM (PCRAM) [18] or memristor [14], to save the

area usage of storing one programming bit and stand-by power as

shown in Fig. 2c. 1.1x and 2x overall performance gains before and

after 3D integration respectively are reported in these two works.

SB SBCB

SB SBCB

CB CB

LB LB LB

LB LB LB

LB LB LB

Memory Layer

Switch Layer

CMOS Layer

3D-FPGA

(a) Monolithic stacking [17].

��

��

��

��

��

��

� ��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

� ��

��

��

��

��

��

��

��

��

��

��

��

��

��

� ��

� ��

��

��

��

��

��

��

��

��

(b) 3D integration with TSVs [18].

(c) Replace SRAMs with emerg-ing NVMs [14].

(d) Nanowire crossbar [19].

Fig. 2: Recent work on FPGA using emerging technologies.

However one potential problem is that the maximum density of

TSVs currently achievable might fail to satisfy the dense connections

required between the routing structure die and logic structure die.

In [19], a 3D CMOS/Nanomaterial hybrid FPGA was proposed.

Again, it separates the transistors of interconnects from those of logic

blocks and redistributes them into different dies as shown in Fig. 2d.

The difference is that it uses nanowire crossbars and face-to-face

3D integration technology to provide connections between the dies.

It provides a 2.6x performance gain compared to the conventional

FPGA architecture with certain power overhead brought by the large

capacitance of crossbar array. In addition the crossbar structure may

consist of a certain percentage of defective components [19]. To sum-

marize all, these works try to stack interconnects over logic blocks for

the sake of reduction of area and interconnect delay. However they

all suffer from a common problem. They still require transistors in

interconnects. To achieve the stacked architectures, these works have

to reorganize the interconnects and logic blocks into separate dies

and introduce 3D die-stacking (or less mature monolithic stacking).

Otherwise the high temperature during fabrication of the transistors

in an upper layer would ruin the existing transistors and metal wires

below in the same die. Die-stacking introduces undesirable problems

in terms of manufacturability, reliability and limit of the vertical

integration density. Some researchers replace the pass transistors

in the routing structure of FPGA with emerging memory elements

integrated in metal layers with back-end-of-line (BEOL) compatible

fabrication [20, 21]. But the improvement is limited due to the small

portion of pass transistors in the routing structure. There are also

studies on new reconfigurable circuit structures, such as CMOL [22]

and CNTFET-based fine-grain cell matrices [23], with the aim of

taking the place of FPGA. Lots of efforts are still needed to realize

these revolutionary ideas in products.

III. MEMRISTOR TECHNOLOGY

In this work, we propose to use memristors and metal wires but no

transistors to build up the routing structure in mrFPGA. We first intro-

2 2011 IEEE/ACM International Symposium on Nanoscale Architectures

duce the memristor technology in this section. CMOS-compatibility

is key property to integrate memristor into the mrFPGA architecture

and several works have demonstrated CMOS-compatible memristor

fabrication processes [10, 24, 25]. For example, Fig. 3 is a schematic

view and cross-sectional transmission electron microscopy (TEM)

image of memristor fabricated by a CMOS-compatible process [24]

for example. As shown in Fig. 3a memristor is a two-terminal

(a) Schematic view. (b) TEM image.

Fig. 3: Fabrication structure of memristor integrated with CMOS in

the same die [24].

resistance-like device which can be fabricated between the top metal

layer and the other metal layers on top of the transistor layer in the

same die. As explained in [24, 25], since no high temperature takes

place during their memristor fabrication processes, the processes can

be embedded after back-end-of-line process and will not ruin the

transistors and metal wires which have been fabricated below the

memristors, as shown in Fig. 3b. We also see from Fig. 3b that

the structure of memristor is quite simple, resulting in the easy

achievement of small cell size. In spite of the simple structure of

memristor, it achieves the storage of one-bit data successfully. Mem-

ristor can be programmed by applying specific programming voltage

or current at its two terminals to switch its resistance value at normal

operation between low state and high state. The resistance values at

the two states are usually more than two orders of magnitude apart

[24], or even up to six orders of magnitude apart [25]. In addition

memristor can keep the resistance value set before being powered off

without supply voltage. With this programmable property, memristor

is used as a promising type of NVM with two different states: low

resistance state (LRS) and high resistance state (HRS). In our work

the important thing is that memristor acts like a programmable switch

which can be reconfigured to determine whether its two terminals are

connected or not. If the two terminals need to be connected to each

other, we just program the memristor to LRS. Otherwise, we program

the memristor to HRS. This special mechanism is quite different from

those of conventional memories and provides a unique opportunity

to build up a new FPGA routing structure with the novel use of

memristors.

IV. mrFPGA STRUCTURE

A. Basic Structure

As discussed in the previous section, there is opportunity to build

up an FPGA routing structure based on memristors only in addition

to metal wires. Also memristor-based routing structure of FPGA

will be placed over the transistor layer in the same die, just like

the routing structure of ASICs. The basic schematic view of this

efficient FPGA architecture with memristor-based reconfiguration

(mrFPGA) is shown in Fig. 4. With the organization of logic blocks

and programmable interconnects in Fig. 4, the overall mrFPGA

architecture will become a highly compact array as shown in Fig. 5a.

From the figure we see that the area of mrFPGA will be reduced

Fig. 4: mrFPGA architecture: The programmable routing structure

is built up with metal wires and memristor which are fabricated on

top of logic blocks in the same die according to existing memristor

fabrication structures.

to the total area of the logic blocks only, which takes only 10%to 20% of the conventional FPGA area [4]. Fig. 5b is a detailed

design of the connection blocks and switch blocks in mrFPGA using

memristors and metal wires only. In this design we use five metal

layers M5 to M9 without resource conflict. The circuits of logic

blocks in mrFPGA will use the metal layers below them, i.e., M1to M4, which are sufficient for practical FPGAs, e.g. Virtex-6 [26].

The memristor layer is designed to locate between the M9 and M8

layers. We see that the positions of the layers are in the same order as

the memristor fabrication structure reported in [24] so as to guarantee

the feasibility of mrFPGA fabrication. Though memristors are located

close to the top layer in this design, electrical signals do not always

have to go through all metal layers to reach memristors, since switch

blocks are in M5 ∼ M9. The signals need to reach M1 or M2 from

the memristor layer only when they access the pins of logic blocks.

Moreover, as stated in Section I, the size of a memristor cell can

be close to or even smaller than the width of a metal wire [9, 10].

So memristors can easily fit into the design in Fig. 5b without extra

area overhead. The memristor array can be programmed efficiently

using V/2 biasing scheme and two-step write operation as proposed

in [15].

B. Interconnects with Adaptive Buffer Insertion

One potential problem of the basic mrFPGA routing structure is

that there are no buffers in the structure. It leads to a quadratic

increase of interconnect delay with the increase of the interconnect

distance. This problem would undermine the benefit brought by the

significant reduction of the overall area and interconnect delay in

the case of large FPGAs. To address this problem, we propose an

improved FPGA architecture which allows adaptive buffer insertion

in the interconnects as shown in Fig. 6. A limited number of buffers

are prefabricated in routing channels and can be connected to the

tracks in channels via memristors. Whether to insert the prefabricated

buffers or not and which track uses a buffer depend on the placement

result of the circuit to implement on mrFPGA. For a routing path

long enough to exhibit the quadratic increase of RC delay, we will

let some buffers be connected to the path at proper positions. To get

the optimal solution of insertion choice for each buffer in the routing

network, we borrow an idea from the buffer placement algorithm in

ASIC [27]. This algorithm can decide the best positions of buffers in

the interconnects of an ASIC using dynamic programming to achieve

the minimum interconnect delay. It constructs delay/capacitance pairs

which correspond to different buffer options in a routing tree of an

ASIC via a depth-first search and produces the optimal option in

time complexity of O(B2) where B is the number of legal buffer

positions. For mrFPGA applications, we let the buffer at certain

position be inserted to the routing track if the algorithm decides to

2011 IEEE/ACM International Symposium on Nanoscale Architectures 3

(a) (b)

Fig. 5: The proposed mrFPGA architecture. (a) Overview of mrFPGA where the FPGA area is contributed by logic blocks only, and (b) A

detailed design of the connection blocks and switch blocks in mrFPGA using the memristors and metal wire resources in existing memristor

fabrication structures.

Fig. 6: An improved mrFPGA architecture with adaptive buffer

insertion in the programmable interconnects.

place a buffer at this position. Then whether or not to insert one

buffer at one routing node in mrFPGA is optimized on demand. In

contrast in conventional FPGA the connections between buffers and

routing tracks are predetermined during fabrication. A case study in

Fig. 7 shows the benefit brought by the adaptive buffer insertion in

mrFPGA when compared to the fixed buffer pattern in conventional

FPGA. To simplify the problem, we assume in this case that the RC

delay of a wire with the length of one block is 0.5RwireCwire = 1ns,

and that the buffers in interconnects are considered as ideal buffers

with infinite drive, no input/output capacitance, and a fixed intrinsic

delay of 9ns. The optimal length of a wire between two adjacent

buffers would be

k =

rTbuffer

0.5RwireCwire

= 3

In conventional FPGA, the pattern of pre-fabricated buffers in inter-

connects will follow this result to distribute buffers evenly with a

distance of three blocks (as shown in Fig. 7). The problem is that

it does not know the starting point of each net (the offset can be

0, 1 or 2). So it just staggers the patterns among different tracks

Fig. 7: Comparison between fixed buffer patten in conventional FPGA

and adaptive buffer insertion in mrFPGA.

in the same channel as shown in Fig. 7. When the output pin of

a logic block happens to be connected to a wire segment that is

buffered right after the connection node, just like the block “start”

in Fig. 7 connected to track3 in the upper channel, the routing result

will deviate from the optimal. The problem can be even worse during

the switch from one track to another. A timing-driven design, such

as [28], has to always buffer the switches between tracks to avoid the

potential large RC delay caused by unbuffered connection between

the two longest unbuffered track segments, in this case as large as

36ns if the two track1s in the channels are connected without a buffer

for example. It drives the routing result farther from the optimal. The

table in Fig. 7 lists all the possible delays from the logic block “start”

to the logic block “end” according to the settings in a conventional

FPGA discussed above. The first column is the track ID in the upper

channel, and the first row is the track ID in the right channel. The

buffers at the output pin of the “start” block and the input pin of

the “end” block are not counted in the delay calculation. As shown


in Fig. 7 the worst case can be 22% slower than the best case in a

conventional FPGA. On the contrary, the adaptive buffer insertion in

mrFPGA will not suffer from the deviation from the optimal buffer

placement in interconnects as does the conventional FPGA. We see

from Fig. 7 that in mrFPGA buffers are inserted just exactly one per

three blocks, resulting in a total delay of only 45ns. It is even better

than the best case in a conventional FPGA.

V. CAD FLOW AND EVALUATION OF mrFPGA

A. CAD Flow

To implement a circuit design in FPGA, a complete CAD flow is

needed. The input of the flow is the design to be implemented and the

output of the flow includes the placement and routing (P&R) result

used for programming the design into FPGA as well as an estimation

of area, delay and power consumption for the implementation of the

design in FPGA. We propose to use a timing-driven CAD flow for our

mrFPGA as shown in Fig. 8. A circuit design first goes through the

Fig. 8: CAD flow for mrFPGA.

design tool ABC [29] to perform logic optimization and technology

mapping. A mapped netlist consisting of K-LUTs will be obtained.

Then the mapped netlist is fed into the design tool T-VPACK [30]

to pack LUTs to logic blocks and then into the design tool mrVPR

developed by us to accomplish P&R. At last we use the generated

basic-cell (BC) netlist with delay and capacitance information to run

the power estimation tool fpgaEVA LP2 [7]. Since we try to reuse

most parts of the existing CAD flow for conventional FPGAs, the

development of mrFPGA CAD flow is quite easy. Only the P&R

tool is specifically developed for mrFPGA. We will introduce the

features of this advanced tool in the last part of this section.

B. Equivalent Circuit Model for Interconnect

To perform the timing and power analysis for mrFPGA, we first

develop an equivalent circuit model for the routing structure of

mrFPGA. Fig. 9 shows the equivalent circuit of a representative

routing path from the output pin of a logic block to the input pin

of another logic block in mrFPGA and makes a comparison with

that of a conventional FPGA. In the circuit model, the MUXs in

connection blocks and switch blocks are replaced by the memristors

at low resistance state (LRS). The wire segments are still modeled

as distributed RC lines. Notice that the overall resistance value

and capacitance value of wire segments have changed due to the

reduction of the tile area in mrFPGA. Also, according to the mrFPGA

architecture in Fig. 5a, the length of a wire segment is one tile shorter

than its counterpart in a conventional FPGA. The easiest way to see

Fig. 9: Extracted equivalent circuit model of the routing structure of

mrFPGA and comparison with conventional FPGA.

it is to think about the wire segment with the length of one tile.

In mrFPGA the wire segment with the length of one tile is just the

short segment between two logic blocks, of which the length is close

to zero compared to the tile side. However the length of one tile

does not disappear. It is counted in the switch block according to the

mrFPGA architecture shown in Fig. 5b.

C. Shielding Effect of Memristor

During the development of the circuit models for the routing structure

of mrFPGA, we find out that memristor has a shielding effect to

help reduce the interconnect delay of mrFPGA further. The shielding

effect of memristors at a high resistance state (HRS) can remove the

downstream capacitance of unused paths from the total capacitance

at one node and then reduce the RC delay to reach this node. Fig. 10

shows an example of the effect in a switch block when compared

to a conventional FPGA. In the conventional FPGA, in spite of the

(a) Conventional FPGA. (b) mrFPGA.

Fig. 10: Memristors provide capacitance shielding from unused paths.

fact that the lower two paths are not used as shown in Fig. 10a, the

downstream capacitances of these two paths are still added to the total

capacitance at node A. In contrast, in mrFPGA only the downstream

capacitance of a path in use will be added to the total capacitance at

node A. The huge resistance value of memristors at HRS provides

capacitance shielding from unused paths. We use HSPICE simulation

to verify whether the resistance of memristor at HRS is high enough

to be qualified as an ideal open switch for the shielding effect. In the

mrFPGA circuit shown in Fig. 10b, we set R as 1×103Ω, and C and

Cload as 1×10−12F. For the actual memristor resistance value, we set

Rlrs as 1×103Ω according to [24, 25] and sweep Rhrs from 1×103Ω,

the same value as Rlrs, to 1× 106Ω. At this starting point of sweep,

the three paths are equivalent and the three downstream capacitances


will be all added to node A in the path from node “start” to node

“end” just like a conventional FPGA. We want to see how large Rhrs

needs to be to make the path delay from node “start” to node “end”

close enough to that in the case where Rhrs are replaced by ideal

open switches. The result obtained by the HSPICE simulations is

shown in Fig. 11. We see that with the increase of Rhrs, the delay

103

104

105

106

3

3.5

4

4.5

5

5.5

6

Rhrs

(Ω)

dela

y (n

s)

shielded by open switch

shielded by memristor

Actual Memristor Region

Fig. 11: The impact of the memristor shielding effect on the path

delay.

approaches to the value with Rhrs replaced by ideal open switches.

The resistance value of an actual memristor at HRS is more than two

orders of magnitude higher than that at LRS as reported in [24, 25].

We mark the actual memristor region according to this criterion and

see that the delay within this region is very close to the ideal situation.

This proves that the shielding effect of memristor can help reduce

path delay further in mrFPGA.

D. mrVPR: an advanced P&R Tool for mrFPGA

To deal with the novel routing structure of mrFPGA in the P&R step

of CAD flow, we develop an advanced tool named mrVPR (VPR for

memristor-based reconfiguration) on the base of the commonly-used

FPGA P&R tool VPR [30]. The main contributions of this tool are

(a) VPR for conventional FPGA (b) Our mrVPR for mrFPGA

Fig. 12: Overview of routing resource in the two tools VPR and

mrVPR.

as follows:

1) mrVPR can deal with the routing graph of mrFPGA as shown

in Fig. 12b. In Fig. 12b the gray blocks are I/O blocks and

logic blocks of FPGA and the diagonal wires are the routing

paths available in switch blocks. We see that the switch blocks

and connection blocks in mrVPR are placed over logic blocks

and provide connections for logic blocks nearby in a different

way from that of a conventional FPGA shown in Fig. 12a.

2) mrVPR can reflect the memristor shielding effect. In VPR all

the device capacitances with prefabricated connections to a

node will be added to the capacitance of that node regardless

of whether the paths which these devices belong to will be

included in the routing result or not. In mrVPR, only when

a routing path is used, the capacitances on that path will be

added to the corresponding routing nodes.

3) mrVPR is integrated with the algorithm to generate the optimal

option for adaptive buffer insertion. We have implemented the

algorithm in mrVPR and it shows where buffers are needed for

insertion in the final routing result as shown in Fig. 13. We

see that the positions for buffers to insert are marked with the

black delta shape. We also see that the interconnect view of

the routing result in Fig. 13b is quite similar to that of ASIC.

We expect to see the good performance of mrFPGA in the next

section.

(a) Routing resource view. (b) Routing result view

Fig. 13: View of the optimal option for buffer insertion in mrVPR.

VI. EXPERIMENTAL RESULTS

A. Settings

We choose Xilinx Virtex-6 FPGA platform as our conventional FPGA

model baseline in experiments and conduct evaluation of mrFPGA

at the same technology node. The architecture logic specification

of Virtex-6 needed by the CAD flow in Fig. 8 are collected from

Virtex-6 FPGA data sheets (Configurable Logic Block part) [26].

As for the architecture routing specification of Virtex-6, such as the

channel width and the distribution of wire segment lengths (Single,

Double, Quad, Long and Global), are extracted with the help of

FPGA Editor in Xilinx ISE environment. The RC delay values of

each type of wire segments are calculated from the unit resistance

and capacitance values in the lookup tables of ITRS2010 [31]. The

timing parameters of buffers inserted in mrFPGA routing structure,

including driving resistance, input capacitance and intrinsic delay, are

obtained via HSPICE simulation using PTM device models [32, 33].

The memristor model is extracted from the measurement results of

the memristors fabricated by the process in [24], the same process on

which mrFPGA is based. The 20 largest MCNC benchmark circuits

are used as the input of the CAD flow for conventional FPGA and

mrFPGA respectively and comparisons are made on the output of the

CAD flow from the aspects of area, delay and power.

B. Evaluation Results

Fig. 14 shows the comparison of the tile area in three FPGA

architectures: the baseline Virtex-6 FPGA, its mrFPGA counterpart


without buffer insertion (a fictitious case to show the impact of

buffer insertion) and mrFPGA with buffer insertion respectively. As

(a) Virtex-6 baseline. Total area = 13993μm2.

(b) mrFPGA unbuffered. (c) mrFPGA. Total area = 2547μm2.

Fig. 14: Area comparison of a single tile in three architectures (unit:

μm2). (LB: logic block; CB: connection block; SB: switch block;

Buf: pre-fabricated buffers to insert on demand in mrFPGA).

discussed in the previous sections, in the case of mrFPGA unbuffered,

the area contribution of the interconnects is reduced to zero since the

interconnects are purely made up of memristors and metal layers

located on top of logic blocks. For mrFPGA with buffer insertion,

we have to count the area of pre-fabricated buffers in the channels

according to the architecture in Fig. 6. The area contributed by buffer

insertion in mrFPGA is pessimistically calculated as the case of a

uniform distribution of pre-fabricated buffers over channels, with the

number of buffers per channel set as the maximum number of buffers

required by mrVPR to insert in one channel of mrFPGA over the 20

benchmarks. We see in Fig. 14 that the area overhead brought by

pre-fabricated buffers is low compared to the interconnect area of

conventional FPGA. That is mainly due to the on-demand property

of buffer solution in mrFPGA. Results in Fig. 14 shows more than

5.5x saving of total area for mrFPGA. Table I shows the performance

comparison between Virtex-6 baseline and mrFPGA. It shows a small

performance degradation before buffer insertion in mrFPGA and a

2.28x speedup after buffer insertion. The total speedup primarily

stems from the reduction of interconnect delay. In the routing

structure of mrFPGA, the reduction of tile area by ∼5.5x results in

a shortening of wire segments in the programmable interconnects by

∼2.35 times. Then both the resistance and capacitance values of wire

segments decrease by ∼2.35 times, contributing to the total reduction

of RC delays of the wire segments by ∼5.5 times. The shielding

effect of memristor also helps to improve performance. The quadratic

increase of the RC delay of a pure RC network cancels out the benefit

in mrFPGA without buffer insertion. We see in Table I that for some

benchmarks with long interconnect paths between logic blocks, the

pure RC delay without buffers is kind of large. However in mrFPGA

with buffer insertion the interconnects perform much better than

Virtex-6 baseline and mrFPGA unbuffered. The good performance

also stems from the adaptive buffer insertion in mrFPGA. Table II

shows the power consumption by Virtex-6 basline and mrFPGA

respectively. An average of 40% power savings is achieved. The

power savings primarily come from the replacement of transistors in

the programmable interconnects, such as MUXs and SRAMs, with

the simple structure of memristors. Both dynamic power and static

power are reduced due to less capacitance on the routing paths and

the non-volatility of memristor. Notice that no die-stacking is needed

to achieve this degree of area reduction and speedup. If 3D integration

technology is introduced in the future to stack several mrFPGAs

together, at least two more times of improvements in density and

speedup, i.e. 10x and 4.5x respectively, are both expected, according

to the experimental results on 3D architecture in [17].

VII. CONCLUSIONS

This work presents a novel FPGA architecture with memristor-based

reconfiguration named mrFPGA. The programmable interconnects of

mrFPGA use memristors and metal wires only. The routing structure

is then able to be placed over logic blocks in the same die according to

the existing memristor fabrication structure. An improved architecture

with adaptive buffer insertion is proposed to further reduce the

interconnect delay. A complete CAD flow is provided for mrFPGA

with an advanced P&R tool named mrVPR developed for mrFPGA.

The tool can deal with the novel routing structure of mrFPGA,

the memristor shielding effect, and the algorithm of optimal buffer

insertion. An evaluation of mrFPGA is done on the 20 largest MCNC

benchmark circuits. Results show that mrFPGA achieves a 5.18x area

savings, a 2.28x speedup and a 1.63x power savings.

REFERENCES

[1] P. Schaumont and I. Verbauwhede, “Domain-specific codesign for em-bedded security,” Computer, vol. 36, no. 4, pp. 68–74, Apr. 2003.

[2] I. Kuon and J. Rose, “Measuring the Gap Between FPGAs and ASICs,”IEEE Transactions on Computer-Aided Design of Integrated Circuits

and Systems, vol. 26, no. 2, pp. 203–215, Feb. 2007.

[3] J. Cong, V. Sarkar, G. Reinman, and A. Bui, “Customizable Domain-Specific Computing,” IEEE Design and Test of Computers, vol. 28, no. 2,pp. 6–15, Mar. 2011.

[4] V. George, “Low Energy Field-Programmable Gate Array,” Ph.D. dis-sertation, UC Berkeley, 2000.

[5] A. DeHon, “Reconfigurable Architectures for General-Purpose Comput-ing,” Ph.D. dissertation, MIT, Dec. 1996.

[6] E. Ahmed and J. Rose, “The Effect of LUT and ClusterSize on Deep-Submicron FPGA Performance and Density,” IEEE Transactions on

VLSI Systems, vol. 12, no. 3, pp. 288–298, Mar. 2004.

[7] F. Li et al., “Power modeling and characteristics of field programmablegate arrays,” IEEE Transactions on Computer-Aided Design of Integrated

Circuits and Systems, vol. 24, no. 11, pp. 1712–1724, Nov. 2005.

[8] D. B. Strukov, G. S. Snider, D. R. Stewart, and R. S. Williams, “Themissing memristor found.” Nature, vol. 453, no. 7191, pp. 80–3, May2008.

[9] Y. S. Chen et al., “Highly scalable hafnium oxide memory withimprovements of resistive distribution and read disturb immunity,” inIEDM Technical Digest, Dec. 2009, pp. 1–4.

[10] C.-h. Wang et al., “Three-Dimensional 4F2 ReRAM Cell with CMOSLogic Compatible Process,” in IEDM Technical Digest, 2010, pp. 664–667.

[11] S.-s. Sheu et al., “A 5ns Fast Write Multi-Level Non-Volatile 1 Kbits RRAM Memory with Advance Write Scheme,” in VLSI Circuits,

Symposium on, 2009, pp. 82–83.

[12] M. Mahvash and A. Parker, “A memristor SPICE model for designingmemristor circuits,” in MWSCAS, no. 2, 2010, pp. 989–992.

[13] S. Yu, J. Liang, Y. Wu, and H.-S. P. Wong, “Read/write schemes analysisfor novel complementary resistive switches in passive crossbar memoryarrays.” Nanotechnology, vol. 21, no. 46, p. 465202, Oct. 2010.

[14] M. Liu and W. Wang, “rFPGA: CMOS-nano hybrid FPGA using RRAMcomponents,” in NanoArch, Jun. 2008, pp. 93–98.

[15] C. Xu, X. Dong, N. P. Jouppi, and Y. Xie, “Design Implications ofMemristor-Based RRAM Cross-Point Structures,” in DATE, 2011.

[16] G. Lemieux and D. Lewis, “Circuit design of routing switches,” Inter-

national Symposium on FPGAs, p. 19, 2002.

[17] M. Lin, A. El Gamal, Y.-C. Lu, and S. Wong, “Performance Benefits ofMonolithically Stacked 3-D FPGA,” IEEE Transactions on Computer-

Aided Design of Integrated Circuits and Systems, vol. 26, no. 2, pp.681–229, Feb. 2007.

[18] Y. Chen, J. Zhao, and Y. Xie, “3D-nonFAR: Three-Dimensional Non-Volatile FPGA ARchitecture Using Phase Change Memory,” in ISLPED,2010, p. 55.

[19] C. Dong, D. Chen, S. Haruehanroengra, and W. Wang, “3-D nFPGA:A Reconfigurable Architecture for 3-D CMOS/Nanomaterial HybridDigital Circuits,” IEEE Transactions on Circuits and Systems I: Regular

Papers, vol. 54, no. 11, pp. 2489–2501, Nov. 2007.

[20] C. Chen et al., “Efficient FPGAs using nanoelectromechanical relays,”International Symposium on FPGAs, p. 273, 2010.

[21] P.-e. Gaillardon et al., “Emerging memory technologies for reconfig-urable routing in FPGA architecture,” in ICECS. IEEE, Dec. 2010, pp.62–65.


TABLE I: Critical Path Delay and Comparison (unit: ns)Virtex-6 Baseline mrFPGA unbuffered mrFPGA

Interconnect Delay Total Delay Interconnect Delay Total Delay Interconnect Delay Total Delay

alu4 6.62 9.54 5.53 (1.19x) 8.76 (1.08x) 1.52 (4.35x) 4.75 (2.00x)

apex2 7.84 10.4 6.96 (1.12x) 10.2 (1.01x) 1.52 (5.15x) 5.12 (2.04x)

apex4 6.79 9.71 7.76 (0.87x) 10.3 (0.94x) 1.35 (4.99x) 4.58 (2.11x)

bigkey 10.9 13.3 37.7 (0.29x) 40.1 (0.33x) 3.98 (2.75x) 4.97 (2.68x)

clma 12.0 14.6 22.7 (0.53x) 24.3 (0.60x) 2.80 (4.31x) 5.27 (2.78x)

des 15.1 17.9 40.8 (0.37x) 43.6 (0.41x) 5.59 (2.71x) 8.02 (2.23x)

diffeq 4.11 5.37 2.72 (1.50x) 5.09 (1.05x) 0.34 (11.7x) 3.27 (1.63x)

dsip 9.61 10.9 27.2 (0.35x) 28.5 (0.38x) 2.52 (3.80x) 4.89 (2.22x)

elliptic 5.93 8.36 10.9 (0.54x) 13.0 (0.64x) 1.72 (3.43x) 4.09 (2.04x)

ex1010 14.2 16.8 22.1 (0.64x) 25.0 (0.67x) 3.63 (3.90x) 6.92 (2.42x)

ex5p 7.23 10.1 5.42 (1.33x) 8.34 (1.21x) 0.76 (9.43x) 4.30 (2.35x)

frisc 8.97 9.92 11.9 (0.75x) 13.2 (0.74x) 0.71 (12.5x) 4.87 (2.03x)

misex3 6.20 9.06 5.73 (1.08x) 8.65 (1.04x) 1.04 (5.91x) 4.58 (1.97x)

pdc 11.1 14.4 18.6 (0.59x) 22.3 (0.64x) 2.25 (4.93x) 6.22 (2.32x)

s298 8.39 11.1 8.29 (1.01x) 10.1 (1.09x) 0.85 (9.86x) 4.27 (2.59x)

s38417 8.39 11.1 8.29 (1.01x) 10.1 (1.09x) 0.85 (9.86x) 4.27 (2.59x)

s38584.1 8.39 11.1 8.29 (1.01x) 10.1 (1.09x) 0.85 (9.86x) 4.27 (2.59x)

seq 7.45 10.6 7.84 (0.95x) 10.3 (1.03x) 1.40 (5.30x) 4.63 (2.30x)

spla 10.8 13.8 11.9 (0.91x) 15.1 (0.91x) 1.88 (5.77x) 5.48 (2.52x)

tseng 4.90 7.27 7.62 (0.64x) 9.99 (0.72x) 0.94 (5.16x) 3.31 (2.19x)

average - - - (0.83x) - (0.83x) - (6.29x) - (2.28x)

TABLE II: Power Consumption and Comparison (unit: mW)Virtex-6 Baseline mrFPGA unbuffered mrFPGA

Interconnect Power Total Power Interconnect Power Total Power Interconnect Power Total Power

alu4 34.4 52.0 6.57 (5.24x) 24.1 (2.15x) 10.8 (3.18x) 28.4 (1.83x)

apex2 40.1 58.5 7.36 (5.44x) 25.8 (2.26x) 11.4 (3.50x) 29.9 (1.95x)

apex4 26.5 42.8 4.37 (6.07x) 20.6 (2.07x) 6.77 (3.92x) 23.0 (1.85x)

bigkey 49.6 375. 6.34 (7.83x) 332. (1.13x) 12.9 (3.82x) 338. (1.10x)

clma 75.0 139. 9.24 (8.12x) 73.2 (1.89x) 16.1 (4.65x) 80.1 (1.73x)

des 112. 558. 19.3 (5.79x) 465. (1.19x) 31.5 (3.56x) 477. (1.16x)

diffeq 15.7 32.5 2.16 (7.29x) 18.9 (1.71x) 3.81 (4.13x) 20.6 (1.57x)

dsip 79.9 409. 11.4 (6.97x) 341. (1.20x) 20.9 (3.82x) 350. (1.16x)

elliptic 52.7 157. 7.28 (7.23x) 111. (1.40x) 12.9 (4.07x) 117. (1.33x)

ex1010 61.3 109. 7.92 (7.74x) 56.3 (1.94x) 13.6 (4.48x) 62.1 (1.76x)

ex5p 20.6 34.7 3.40 (6.06x) 17.4 (1.98x) 5.47 (3.77x) 19.5 (1.77x)

frisc 28.1 60.5 2.87 (9.76x) 35.2 (1.71x) 5.25 (5.34x) 37.6 (1.60x)

misex3 33.4 50.4 6.31 (5.29x) 23.2 (2.16x) 10.0 (3.33x) 26.9 (1.86x)

pdc 57.3 97.1 8.48 (6.75x) 48.2 (2.01x) 13.8 (4.14x) 53.6 (1.81x)

s298 16.0 28.6 2.30 (6.95x) 14.9 (1.92x) 3.81 (4.20x) 16.4 (1.74x)

s38417 16.0 28.6 2.30 (6.95x) 14.9 (1.92x) 3.81 (4.20x) 16.4 (1.74x)

s38584.1 16.0 28.6 2.30 (6.95x) 14.9 (1.92x) 3.81 (4.20x) 16.4 (1.74x)

seq 37.1 55.6 6.75 (5.49x) 25.2 (2.20x) 10.9 (3.39x) 29.4 (1.88x)

spla 51.1 87.1 8.28 (6.17x) 44.2 (1.96x) 13.0 (3.90x) 49.0 (1.77x)

tseng 20.4 70.8 2.50 (8.19x) 52.8 (1.34x) 4.93 (4.14x) 55.3 (1.28x)

average - - - (6.81x) - (1.80x) - (3.99x) - (1.63x)

[22] D. B. Strukov and K. K. Likharev, “CMOL FPGA: a reconfigurablearchitecture for hybrid digital circuits with two-terminal nanodevices,”Nanotechnology, vol. 16, no. 6, pp. 888–900, Jun. 2005.

[23] P.-E. Gaillardon, F. Clermidy, I. O’Connor, and J. Liu, “Interconnectionscheme and associated mapping method of reconfigurable cell matricesbased on nanoscale devices,” in NanoArch, Jul. 2009, pp. 69–74.

[24] K. Tsunoda et al., “Low Power and High Speed Switching of Ti-dopedNiO ReRAM under the Unipolar Voltage Source of less than 3 V,” inIEDM Technical Digest, Dec. 2007, pp. 767–770.

[25] R. Huang et al., “Resistive switching of silicon-rich-oxide featuring highcompatibility with CMOS technology for 3D stackable and embeddedapplications,” Applied Physics A, pp. 927–931, Mar. 2011.

[26] Xilinx, “Virtex-6 FPGA data sheets.” [Online]. Available:http://www.xilinx.com/support/documentation/virtex-6.htm

[27] L. van Ginneken, “Buffer placement in distributed RC-tree networks forminimal Elmore delay,” in ISCAS, 1990, pp. 865–868.

[28] I. Kuon and J. Rose, “Area and delay trade-offs in the circuit andarchitecture design of FPGAs,” in International Symposium on FPGAs,

2008, p. 149.

[29] “Berkeley Logic Synthesis and Verification Group, ABC: A Systemfor Sequential Synthesis and Verification, Release 61225.” [Online].Available: http://www.eecs.berkeley.edu/∼alanmi/abc/

[30] V. Betz, J. Rose, and A. Marquardt, Architecture and CAD for Deep-

Submicron FPGAs. Norwell: MA:Kluwer, 1999.

[31] “International Technology Roadmap for Semiconduc-tors,” Tech. Rep., 2010. [Online]. Available:http://www.itrs.net/Links/2010ITRS/Home2010.htm

[32] W. Zhao and Y. Cao, “Predictive Technology Model (PTM) website.”[Online]. Available: http://ptm.asu.edu/

[33] W. Zhao and Y. Cao, “New Generation of Predictive Technology Modelfor Sub-45nm Design Exploration,” in ISQED, 2006, pp. 585–590.


mrfpga: a novel fpga architecture with memristor …cadlab.cs.ucla.edu/~cong/papers/nano11.pdfusing...

Documents