i crossbar architectures for vlsi systems: a...
TRANSCRIPT
iCROSSBAR ARCHITECTURES FOR VLSI SYSTEMS:
A COMPARATIVE STUDY
A Thesis
Presented in Partial Fulfillment of the Requirements for the
Degree of Master of Science
with a
Major in Electrical Engineering
in the
College of Graduate Studies
University of Idaho
by
Enrique Coen-Alfaro
April 2004
Major Professor: Gregory W. Donohoe
ii
AUTHORIZATION TO SUBMIT
THESIS
This thesis of Enrique Coen-Alfaro, submitted for the degree of Master of Science with a
major in Electrical Engineering and titled “Crossbar Architectures for VLSI Systems: A
Comparative Study,” has been reviewed in final form. Permission, as indicated by the
signatures and dates given below, is now granted to submit final copies to the College of
Graduate Studies for approval.
Major Professor _______________________________________Date___________
Gregory W. Donohoe
Committee
Members ________________________________________Date___________
James F. Frenzel
________________________________________Date___________
Robert Rinker
Department
Chair _________________________________________Date___________
Joseph J. Feeley
Engineering
College Dean _________________________________________Date___________
David E. Thompson
Final Approval and Acceptance by the College of Graduate Studies
_________________________________________Date___________
Katherine G. Aiken
iii
ABSTRACT
Programmable interconnect is responsible for much of the versatility in reconfigurable
hardware. Crossbars are often used to implement programmable interconnect for a variety
of applications. This study focuses on crossbar design for integrated circuits. It compares
three alternative crossbar designs: (1) a full crossbar or simple mesh, (2) a synthesized,
multiplexer-based crossbar, and (3) a Benes network. Crossbar circuits are compared in
terms of transistor count and area, throughput, power consumption, reliability, and
programming requirements. Electrical comparisons are based on transistor level simulations
of the circuits. A variety of application scenarios are discussed. For these application
scenarios, advantages and disadvantages of each crossbar architecture are presented.
Increasing crossbar size affects all the metrics for comparison. These effects of scaling are
also considered within this report. A programming algorithm is presented for the Benes
network.
iv
TABLE OF CONTENTS
Authorization to Submit Thesis……………………………………………………………....ii
Abstract……………………………………………...……….……………………………...iii
Table of Contents…………………………………...……………………………………….iv
List of Figures………………………………………………………………………………vii
List of Tables……………………………………………………………………………..…ix
Introduction…………………………………………………………………………………..1
Chapter 1: Reconfigurable Interconnect for Application Specific Integrated Circuits…...….2
1.1 Context and motivation………………………………………………………2
1.2 Previous work on crossbar architectures……………………………………..7
1.3 Crossbar general concepts……………………………………………………9
1.4 Overview of the experiment………………………………………………...15
Chapter 2: Physical Design Considerations for Crossbars in VLSI Systems………………20
2.1 Crossbar design goals and applications………………………………….…20
2.2 Transistor sizing and delay……………………………………………...….21
2.3 Power consumption……………………………………………………..….23
2.4 Noise immunity, leakage current, and other second order effects……...….24
2.5 Reliability…………………………………………………………………..25
v
2.6 Programming, Routability and Pin count…………………………………..29
2.7 Ground rules for crossbar study…………………………………………....31
Chapter 3: The Simple Mesh Crossbar……………...……………………………………..40
3.1 General circuit description……………………………………………........40
3.2 Transistor count and area……………………………...……………….…..42
3.3 Delay estimations……………………………………………………….….44
3.4 Power consumption……………………….…………………………….….50
3.5 Other considerations…………………………………………………….….56
Chapter 4: Multiplexer Based Crossbar……………...…………………………..………..59
4.1 General circuit description………………………………………….…..…59
4.2 Transistor count and area……………………………………………….….63
4.3 Delay estimations……………………………………………………….….66
4.4 Power consumption…………………………………………………….…..68
4.5 Other considerations……………………………………………………......71
Chapter 5: The Benes Network……….……………...…………………………..………..74
5.1 General circuit description…………………………………………………..74
5.2 Transistor count and area…………………………………………….……...77
5.3 Delay estimations……………………………………………………………80
5.4 Power consumption……………………………………………………...….85
5.5 Other considerations…………………………………………………….......88
vi
Chapter 6: Architecture Comparison.………………...…………………………..………..94
6.1 Area and scalability……...………………………………………………....94
6.2 Throughput…………………………………………………………….…...96
6.3 Power consumption….……………………………………………………..99
6.4 Architectural efficiency and programming complexity…………..……….100
6.5 Application scenarios...…………………………………………………....102
6.6 Conclusions and future work……………………………………………...104
References………………………………………………………………………………107
vii
LIST OF FIGURES
Figure 1.1: Relative sizes of metal wires and transistors……...……………………………..4
Figure 1.2: Blocking and non-blocking networks……………………………………….….10
Figure 1.3: Basic crossbar……………………………………………………………….….11
Figure 1.4: Multicasting network………………………………………………………...…12
Figure 1.5: Fully connected vs. partially connected network……………………………….13
Figure 1.6: Full crossbar, mesh approach………………………………………………...…15
Figure 1.7: Automatically synthesized crossbar…………………………………………….16
Figure 1.8 (a): 4x4 Benes network. Select lines removed for clarity……………………….17
Figure 1.8 (b): Schematic diagram of a Butterfly switch………………………………..….17
Figure 2.1: Electromigration in aluminum wires……………………………………...…….26
Figure 2.2: Grain structures on VLSI metal wires…………………………………………..27
Figure 2.3: Obtaining Spice simulations from Cadence layout…………………………..…33
Figure 2.4: Circuit for measuring crossbar delay…………………………………………...34
Figure 2.5: Circuit for measuring delay across two inverters……………………………….35
Figure 3.1: 4x4 simple mesh crossbar………………………………………………………41
Figure 3.2: Layout for simple mesh crossbar……………………………………………….42
Figure 3.3: Transistor circuit for one I/O connection, using a full transmission gate……....43
Figure 3.4: Scaling trend for simple mesh crossbar………………………………………...44
Figure 3.5: Rising and falling transitions using an NMOS switch………………………….46
Figure 3.6: Average delay for pass transistor version of the simple mesh crossbar………...47
Figure 3.7: Average delay for the full pass gate version of the simple mesh crossbar……..48
Figure 3.8: Itot for mesh with NMOS switches…………...…………………………………51
Figure 3.9: Load seen by one crossbar input………………………………………………..51
Figure 3.10: Output voltage degradation for the AMI crossbar…………………………….53
Figure 3.11: Output voltage degradation for the ULP crossbar…………………………….53
Figure 3.12: Total current for mesh with full pass gates as switches…………………...….54
viii
Figure 4.1: VHDL code for 4x4 bidirectional crossbar……...…………………………….60
Figure 4.2: Synthesized 4x4 bidirectional crossbar………………………………………...61
Figure 4.3: Four to one MUX implementation……………………………………..………62
Figure 4.4: Layout for the synthesized 4x4 crossbar……………………………………….63
Figure 4.5: Transistor circuit for a two-to-one multiplexer……………………………...…64
Figure 4.6: Scaling trend for synthesized crossbar…………………………………………66
Figure 4.7: Average delay for the synthesized crossbar…………………………………....67
Figure 4.8: Power consumption vs. frequency for the 4x4 synthesized crossbar…………..70
Figure 4.9: Power consumption vs. frequency for the 8x8 synthesized crossbar…………..70
Figure 4.10: Power consumption vs. frequency for the 16x16 synthesized crossbar………71
Figure 5.1: Behavior of the butterfly switch………………………………………………..74
Figure 5.2: 4x4 Benes network……………………………………………………………..74
Figure 5.3: 8x8 Benes network……………………………………………………………..75
Figure 5.4: 16x16 Benes network…………………………………………………………..76
Figure 5.5: Layout for the 4x4 Benes network……………………………………………..77
Figure 5.6: Butterfly switch using full pass gates…………………………………………..78
Figure 5.7: Scaling trend for Benes network……………………………………………….79
Figure 5.8: Average delay for the pass transistor version of the Benes crossbar…………...82
Figure 5.9: Average delay for the full pass gate version of the Benes crossbar……………82
Figure 5.10: Power consumption vs. frequency for the 4x4 Benes crossbar using NMOS...87
Figure 5.11: Algorithm for configuring Benes networks…………………………………...90
Figure 5.12: Benes crossbar and configuration matrix…………………………………..….91
Figure 6.1: Transitor count vs. Crossbar size……………………………………………….95
Figure 6.2: Layout area vs. Crossbar size…………………………………………………...95
Figure 6.3: Signal delay using AMI transistor models……………………………………...97
Figure 6.4: Signal delay for crossbars using ULP transistor models………………………..98
Figure 6.5: Worst case wire length………………………………………………………….98
Figure 6.6: Static power consumption for crossbars using ULP transistor models………..100
Figure 6.7: Architectural efficiency for different crossbars……………………………….101
ix
LIST OF TABLES
Table 2.1: Delay across two inverters driving a 0.1 pF capacitance…………………...…...35
Table 3.1: Transistor Count and Area…………………………………………………….…43
Table 3.2: Delay summary for simple mesh crossbar using NMOS transistors as switches..45
Table 3.3: Delay summary for simple mesh crossbar using full pass gates as switches…....45
Table 3.4: Average parasitic capacitance for simple mesh crossbar………………………..49
Table 3.5: Worst case wire lengths………………………………………………………….50
Table 3.6: Power consumption summary for simple mesh using NMOS switches………....55
Table 3.7: Power consumption summary for simple mesh using full pass gate switches…..55
Table 3.8: Architectural efficiency for the simple mesh……………………………………57
Table 4.1: Transistor count and layout area……………………………………………...…64
Table 4.2: Delay summary for multiplexer-based crossbar………………………………...67
Table 4.3: Worst case wire lengths…………………………………………………………68
Table 4.4: Power consumption summary for synthesized crossbar………………………...69
Table 4.5: Architectural efficiency for synthesized crossbar……………………………….72
Table 5.1: Transistor Count and Area………………………………………………………78
Table 5.2: Delay summary for Benes network using NMOS transistors as switches………80
Table 5.3: Delay summary for Benes network using full pass gates as switches………..…80
Table 5.4: Average parasitic capacitances for Benes crossbar……………………………..84
Table 5.5: Worst case wire lengths…………………………………………………………85
Table 5.6: Power consumption summary for Benes networks using NMOS transistors…...85
Table 5.7: Power consumption summary for Benes crossbar using full pass gates………..86
Table 5.8: Minimum frequency for ULP power savings…………………………………..88
Table 5.9: Architectural efficiency for the Benes crossbar………………………………...88
1
INTRODUCTION
Reconfigurable hardware provides a low cost solution for Application Specific Integrated
Circuits (ASICs). Such hardware owes much of its versatility to programmable
interconnect. Crossbars are often used to implement this programmable interconnect. This
study focuses on crossbar design for very large scale integrated (VLSI) circuits. It compares
three crossbar designs: (1) a full crossbar or simple mesh, (2) a synthesized, multiplexer-
based crossbar, and (3) a Benes network. The comparisons are based on transistor count,
area, throughput, power consumption, reliability, architectural efficiency, and programming
requirements.
Varying circuit size has a significant impact on overall crossbar performance, as defined
by the metrics for comparison listed above. All crossbar designs were laid out for three
different sizes: a 4x4 crossbar, an 8x8 crossbar, and a 16x16 crossbar. Scaling trends for
various crossbar parameters were analyzed. Advantages and disadvantages of each crossbar
architecture are presented, for a variety of application scenarios.
Electrical comparisons are based on transistor level simulations of the circuits.
Transistor level netlists include parasitic capacitances extracted from silicon layout.
Synopsys® and Cadence® design tools were used to generate the layout. Transistor level
simulations were done using Smartspice®.
Chapter 1 frames the problem of reconfigurable interconnect in VLSI systems, and
defines the terminology associated with crossbars. Chapter 2 summarizes the physics
necessary to understand the different application scenarios. It also describes the simulation
procedure, and the ground rules for valid comparisons across architectures. Chapter 3
describes the simple mesh crossbar. Chapter 4 presents the multiplexer-based crossbar.
Chapter 5 is devoted to Benes networks. Chapter 6 compares the three architectures,
providing conclusions and recommendations for future work.
2
CHAPTER 1: RECONFIGURABLE INTERCONNECT FOR APPLICATION
SPECIFIC INTEGRATED CIRCUITS
1.1 Context and motivation
The relentless proliferation of integrated circuit applications has created a vast market for
application specific integrated circuits (ASICs). The cost of low volume fabrication runs
remains the most expensive step in ASIC design. Therefore, reconfigurable computing has
emerged as the most cost-effective solution for many applications. Moreover, the possibility
of reconfiguring a device to perform a variety of tasks reduces the amount of costly and time
consuming circuit design. Likewise, in-system reconfiguration enables the same hardware
to be used for a variety of different problems, achieving some of the flexibility of software
with the performance of dedicated hardware.
The onset of this reconfigurable computing trend dates back to the mid 1980s, when field
programmable gate arrays (FPGAs) were introduced. At the time, their main application
was rapid prototyping of “glue logic”, and, eventually, rapid prototyping of full digital
systems. Growing densities and improved CAD tools eventually made FPGAs very
attractive for applications like hardware acceleration, logic emulation and custom
(reconfigurable) computing, among others [2]. Beyond FPGAs, the next generation in
reconfigurable computing includes field programmable processor arrays (FPPAs). The
difference between FPGAs and FPPAs is their level of granularity. While FPGAs allow
reconfiguration of the system at the gate level, FPPAs provide higher level functional
blocks, such as adders and multipliers, which can be interconnected to perform a variety of
tasks.
3
Whether it be FPGAs, complex programmable logic devices (CPLDs) or field
programmable processor arrays (FPPAs), the versatility of reconfigurable computing is
highly dependent on programmable interconnect. In fact, interconnect takes up to 90%
of the area of an FPGA [8]. In order to improve performance and reduce power
consumption, programmable devices are built on a hierarchical architecture, with multiple
partitions sharing central crossbars [16,21,22].
Although the literature on FPGAs usually describes the high-level architectures in detail,
there is very little information on the circuit-level or the physical design of the devices,
because much of the information is proprietary [6]. Considering that reconfigurable devices
may be used in such a wide range of applications, it is an interesting problem to determine
whether certain circuit implementations are better suited to achieve specific design
objectives, such as high speed, low power consumption, or radiation tolerance.
Several controversies surround the design of reconfigurable interconnect for very large
scale integration (VLSI) systems. There is an ongoing debate regarding the degree to which
interconnect related issues will dominate integrated circuit (IC) design in the future.
Considering that metal conductor geometries have not shrunk at the same rate as transistor
geometries, many would agree that scaling into deep submicron technologies has shifted the
paradigm from device-dominated to interconnect-dominated design methodology. The
impact of metal wires on current VLSI systems is most obvious in terms of circuit size, as
figure 1.1 illustrates.
4
Figure 1.1: Relative sizes of metal wires and transistors.
Transistors used to make up about 30% of total circuit volume, with interconnect and
field oxide accounting for the remaining 70%. In 2004, transistors take up less than 2% of
the total circuit volume.
As technology continues to scale towards higher densities, the area of the logic will
decrease, but the relative effect of the delays due to wiring will increase [5]. According to
Cheng et al. [4],
“Increasing circuit sizes on the one hand, and shrinking design rules on the other, have
shifted ICs and MCMs from being device dominated to being interconnect dominated. In
deep submicron technology, wire delay can no longer be ignored, and must be factored
prominently into any overall performance metric.”
5
In contrast to such dire warnings about the relevance of interconnect in IC performance,
current CAD tools and wire models do not provide accurate assessments of the effects of
wiring in circuit speed and power consumption. The available models seem particularly
simplistic when attempting to describe heavily wired circuits, such as crossbars and
multistage interconnection networks, which are at the core of any reconfigurable computing
solution. Apparently, current CAD tools have not yet caught up with the demands of deep
submicron design. Yet, most experts agree that there is a trend. As feature sizes scale more
aggresively in the silicon than in the metal, the role of interconnect in critical design criteria
becomes more significant. Moreover, clock rates have increased far beyond 100 MHz,
where simple resistance-capacitance (RC) wire models do not capture the transmission line
behavior of metal lines. Although there is controversy about the magnitude of this impact,
there is consensus that technology is evolving towards a scenario where interconnect
dynamics will play a larger role in IC performance. Even if we have not reached the point
where interconnect-related effects severely limit circuit throughput or power consumption
for most applications, it is important to note that reconfigurable computing relies on
interconnect intensive circuitry. Thus, it has become relevant to look more closely at the
issue of programmable interconnect performance, and its effect on reconfigurable VLSI
systems.
The second controversy surrounding reconfigurable interconnect deals with the optimal
routing architecture for a given application. By routing architecture this document is
referring to the organization of wire segments and switches in a reconfigurable network.
Routing architecture has a strong effect on the speed, cost and routability of any system. In
fact, for most reconfigurable computing applications, routing architecture is the key
determining factor of system speed and logic density, because programmable switches have
significant resistance and capacitance. These switches also require additional area not used
in fixed interconnection systems[5,6,15].
6
The increasing effect of delays due to wiring mentioned above will create further
demands on the routing architecture. In particular, the amount of logic available on a single
reconfigurable chip will also increase, resulting in larger systems being built [5,6].
Architecturally, there will always be a need for innovation in logic and routing structures,
and thus it becomes relevant to ask which architecture of reconfigurable interconnect
provides the best solution for a particular application.
Many routing architecture alternatives have been studied in a variety of contexts. The
problem of reconfigurable interconnect is akin to the problem of switching networks for
telephone systems. In both cases, the goal is to provide a pathway that connects point A to
point B. Furthermore, as in a telephone system, it is desirable for a VLSI interconnect block
to guarantee that the establishment of a particular connection does not prevent other
connections between previously unconnected points. As a result, several routing
architectures for VLSI applications are based on telephone switching network schemes, such
as those proposed by Clos and Benes [1,7].
Although research has been done to assess the performance of different routing
architectures for telecommunications, studies dealing specifically with hardware crossbar
networks for VLSI applications are scarce. Permutation networks using a variety of
structures have been studied in theory quite extensively. Although delay boundaries in
terms of number of switches and path length have been established, there seems to be no
study based on transistor level simulation of the devices. Such a simulation might yield
important information about the behavior of the actual physical circuits. The absence of
circuit level simulations may be due to the lack of appropriate wire models mentioned
above. Notice that assuming wire delays to be zero, which is a very common default value
in most CAD tools, completely neglects any effect that physical routing might have on
circuit performance.
7
1.2 Previous work on crossbar architectures
Gaudet et al. [11] perform a comparison of crossbar architectures similar to the ones
presented in this study. However, their comparisons focus on the application of crossbars to
the design of iterative decoders. The present document explores the viability of several
crossbar architectures in the more general realm of reconfigurable computing. Furthermore,
[11] provides coarse capacitance estimations based on
transistor count, but there are no dynamic timing simulations to validate the theoretical
calculations. In contrast, the present report includes delay estimations based on actual
layout of the crossbars, for two different fabrication processes. Circuit level simulations are
provided to illustrate the dynamic timing results. The simulation data should help expand
the observations put forth in more theoretical approaches. By bringing the theoretical and
empirical aspects together, the VLSI engineer will gain insight into the most appropriate
interconnection scheme for a particular application.
The published literature regarding reconfigurable interconnect provides a wealth of
theory, but it is lacking in practical comparisons. Empirical studies that evaluate the
implementation of real circuits on different architectures would provide a clearer picture of
the advantages of each architecture relative to others [11]. Thorough empirical studies are
limited by cost, and observability of the phenomena being studied. In the context of
reconfigurable interconnect, simulations are limited by wire models. In particular, situations
like the high leakage phenomena associated with ultra low power CMOS technologies may
be inadequately modeled in commercially available CAD tools. As the relevance of
interconnect analysis grows, CAD developers will need to provide better wire models.
In order to offer a statistically meaningful performance assessment for any reconfigurable
computing architecture, it is particularly important that a good set of benchmark circuits be
available. With the trend toward larger systems-on-a-chip, true comparative studies can
8
only have merit when they use circuits that reflect a wide span of scenarios. Although a few
studies provide speed comparisons based on benchmarks at the board level, no set of
benchmarks has been established to gauge the performance of reconfigurable interconnect
within a single chip [5,15]. Furthermore, the development of reliable benchmarks for
reconfigurable interconnect architectures falls outside the scope of this project.
Consequently, the simulations performed here serve only to illustrate circuit behavior under
specific conditions. The simulation results presented in this document suggest a behavioral
pattern, which, coupled with theoretical considerations, leads to the architectural
recommendations issued in this study.
However, it must be noted that the data gathered are insufficient for a proper statistical
analysis. In spite of the numerical data and calculations presented, the results of this study
should be regarded as qualitative, rather than quantitative.
As mentioned above, the goal is to identify which interconnect structures are best suited
for specific environments and applications. In particular, three alternatives for building
crossbars are compared: a standard crosspoint matrix approach, a synthesizable
combinational logic approach, and a butterfly switch approach, based on Benes networks
[1]. This report points out advantages and disadvantages of each architecture, and ultimately
makes recommendations as to which architectures should be considered for design in
specific situations.
Among the metrics for comparison, aspects such as area, speed, power consumption,
flexibility, fabrication complexity and programming complexity have been evaluated. The
comparison also takes into account the effects of varying the number of nodes that the
crossbar must connect, as well as performance variations across different fabrication
processes.
9
Our observations emphasize circuit level simulation results, incorporating as much real
layout information as possible. The purpose of such emphasis is to gain insight into
electrical aspects of reconfigurable interconnect design that have not been compared in
previous similar studies. Despite the relative inaccuracy of wire models in traditional CAD
tools, studying dynamic timing behavior illuminates circuit characteristics that may be
crucial in certain environments. Moreover, the comparisons have shed light on certain
specific shortcomings of current wire models, which are also discussed.
The case of Benes networks is particularly interesting. DeHon first explored the
theoretical advantages of programmable interconnect using non-blocking Benes networks
[8]. Additionally, [2] and [11] present routing architectures based on Benes networks for IC
applications. However, none of these references provides a VLSI implementation of their
networks. This thesis takes the next step in asserting the
feasibility of Benes networks in VLSI systems. First, it provides a transistor level circuit
design for non-blocking Benes networks, complete with silicon layout. It then incorporates
the layout information into dynamic simulation results through capacitance extraction.
Beyond the realm of physical circuit realization, any new design must provide the
appropriate implementation tools to configure the circuit [5,6]. For the proposed Benes
network, the activation and deactivation of the butterfly switches to achieve specific
interconnection patterns is not a trivial problem. Thus, an algorithm for programming the
butterfly switch network is included.
1.3 Crossbar general concepts
The purpose of this section is to introduce the basic terminology used to describe the
different interconnect structures being compared. The concepts presented here are intended
10
only as an introduction. Additional terms will be defined in the sections where they become
relevant.
In a blocking interconnection network, establishing certain connections between inputs
and outputs prevents some inputs from being connected to other outputs. In non-blocking
structures, once any specific connection has been established, there is still at least one path
that leads from any unconnected input to any unconnected output. In figure 1.2, each black
bubble represents a valid connection port. The solid lines represent connections that have
been made. The dashed lines represent possible alternate connections. If we restrict each
connection port to provide one and only one input/output (I/O) connection, the top network
is non-blocking. Meanwhile, in the array at the bottom, establishing a connection between
In0 and Out1 prevents the possibility of connecting In1 to Out0. The bottom diagram
illustrates a blocking network.
Figure 1.2: Blocking and non-blocking networks.
11
A crossbar is a network topology in which any input port can be connected to any free
output port without blocking [12]. A basic crossbar, or crosspoint, switch has a set of
inputs, a corresponding set of outputs, and a set of addresses mapping inputs to outputs, as
shown in figure 1.3. The size of the crossbar refers to the number of points being connected.
It is typically specified as a number of inputs and a number of outputs. Thus, a 5x3 crossbar
is a crossbar in which five inputs can be connected to three outputs.
Figure 1.3: Basic crossbar
According to Han et al. [12], crossbar networks can be categorized into three major
topological classes: full-crossbar networks, multistage interconnection networks (MINs),
and networks consisting of multiple levels of full crossbar connections, called hierarchical
crossbar interconnection networks (HCINs). The implementations presented in this study
belong to the first two categories. A full-crossbar network is comprised of a single
switching element. In contrast, multistage interconnection networks (MINs) consist of
multiple interconnected layers of simpler switches.
12 A multistage switching network is said to be rearrangeable if, for every matching of
input to output pins, there exists a switch setting sequence such that the matching can be
realized. A set of input to output matchings that must be achieved simultaneously
constitutes a permutation of the MIN. Utilization of the network is defined as the ratio of
the number of programmable switches used to implement a specific permutation over the
total number of switches in the network [2].
In terms of flexibility, crossbars may be classified according to the following criteria:
• Bi-directional or unidirectional: The purpose of any crossbar is to connect two points,
A and B, for instance. In a bi-directional crossbar, data can flow freely from A to B, or
from B to A. In unidirectional crossbars, the set of inputs and the set of outputs are
clearly differentiated. Data may flow from an input to an output, but not in the opposite
direction.
• One-to-one or multicasting: Multicasting crossbars allow for one input to be broadcast
simultaneously through more than one output. One-to-one crossbars require that each
input is connected to one and only one output at any given time.
Figure 1.4: Multicasting network.
13• Fully connected or partially connected: Fully connected crossbars allow any input to
be connected to any output. In contrast, a partially connected crossbar is such that at
least one input cannot reach one or more outputs. In other words, the idea of full
connectivity implies that all one-to-one mappings of the input pins to the output pins are
possible [2]. In figure 1.5 the dashed lines represent all valid connections. Notice that,
in the bottom diagram, In1 is isolated from Out0, making this a partially connected
crossbar.
Figure 1.5: Fully connected vs. partially connected network.
• Static or dynamic configuration: The notion of static and dynamic configuration of
crossbars relates more to the application, than to the nature of the crossbars themselves.
Static configuration refers to the case in which the crossbar connections are set at
configuration time and then stay fixed while the system is running. Dynamic crossbars
allow the possibility of reprogramming connections during run time. That is, a given set
of connections may be reconfigured even as data run through another portion of the
same crossbar.
14
In the available literature, the architectures of interconnection networks have been
classified into two generations, also known as direct and indirect routing [10, 17]. In the
first generation architecture, or direct routing, processing elements are directly connected to
each other in a mesh pattern or some other fixed pattern, and it is the processing element
itself, i.e., an FPGA, that may be reconfigured. In other words, the interconnect within an
FPGA may be reconfigured, but when building a network of FPGAs, the connection
between different FPGAs is fixed. The central principle of this mesh architecture is based
on the locality of circuit designs, that is, a node of circuits is more likely connected to its
neighboring nodes. The second generation of routing architectures, also known as indirect
routing, is the partial crossbar interconnection network. With indirect routing, it is possible
to reconfigure the connections between processing elements, in addition to configuring the
processing elements themselves. A commercial example of second generation routing are
field programmable interconnect devices (FPIDs), which are used to connect several FPGAs
at the board level. The evolution of systems from first generation routing to second
generation routing provides evidence of the increasing role of programmable interconnect in
the ASIC market.
We will use the term architectural efficiency, defined in [2], to encompass the number of
programmable switches, number of I/O pins, average utilization of programmable resources,
and delay encountered in a routing path. As a figure of merit useful in comparing routing
architectures, architectural efficiency will be computed using
equation 1.1.
pathinelementsDelayswitchesofNumbernUtilizatioinputsofNumberefficiencyralArchitectu
××
= (1.1)
15
1.4 Overview of the experiment
The study presented here compares three different crossbar architectures. One of the
architectures is a full crossbar, while the other two represent multistage interconnection
networks. The full crossbar is essentially a mesh of metal lines where each input/output
connection is provided by a pass gate, as shown in figure 1.6. To take advantage of the
layout regularity, this mesh was custom designed and laid out, without the aid of any
hardware description languages or synthesis tools.
Figure 1.6: Full crossbar, mesh approach.
The second architecture uses multiplexers, tristate buffers, and logic gates to select the
input/output connection. This architecture is the result of CAD synthesis from a VHDL
behavioral description of the crossbar. Figure 1.7 illustrates a 4x4 crossbar synthesized
using this method.
16
Figure 1.7: Automatically synthesized crossbar.
The final architecture is based on the standard Benes network described in [1] and [2].
This approach allows for certain flexibility, as it provides several choices concerning the
length of the routing paths or the number of switches. This is also the only one of the three
designs in which scalability and programming are not straightforward. As an example, the
4x4 network is shown in figure 1.8(a). The blocks labeled as B are butterfly switches,
whose schematic diagram is shown in figure 1.8(b). Each butterfly switch consists of four
transmission gates. These transmision gates work in pairs. When control signal S is a zero,
the top and bottom transmission gates are on, while the transmission gates on the left and
right are off. Hence, when S is zero, In0 is connected to Out0, and In1 is connected to Out1.
When S switches to one, the top and bottom gates turn off, and the left and right gates turn
on. At this point, In0 connects to Out1, and In1 connects to Out0. The layout for all Benes
networks in this study was drawn manually, based on transistor level schematic diagrams.
17
Figure 1.8 (a): 4x4 Benes network. Select lines removed for clarity.
Figure 1.8(b): Schematic diagram of a Butterfly switch.
Each architecture was used to build crossbars of three different sizes: 4x4, 8x8, and
16x16. Among the aspects to be considered in comparing these circuits were throughput,
scalability, design complexity and time to design, programming requirements, power
consumption, transistor count, and area penalty. All circuit design and simulations were
completed through the Cadence design flow, provided by their IC front to back tools,
18
available at the University of Idaho's Electrical and Computer Engineering department.
Circuit level simulations were performed using Smartspice ®.
All of the circuits presented above fit the definition of a crossbar, as they are non-
blocking interconnect structures. Furthermore, all circuits are designed to be fully
connected and bidirectional, when used for one to one connections in a static configuration
mode. Note that, in this case, multicasting implies a sense of direction, as it does not make
sense to drive one output through more than one permanently connected input. In the case
of dynamic configuration, it is possible to retain bidirectionality even when multicasting
signals. It should be remarked that although the Benes network is non-blocking for a one to
one network, multicasting may in fact block other inputs from reaching their outputs. These
properties will be addressed in the following chapters. Unless otherwise noted, the networks
will be assumed to be operating in a one to one, statically configured mode, which makes
them fully connected, bidirectional crossbars.
A total of nine circuits were designed, and each of them has been simulated for two
fabrication processes with very different characteristics. One is a standard, speed driven
fabrication technology (AMI 0.6 microns). The other process uses the same layers in
drawing the layout, but parasitic model parameters have been skewed to emulate the
characteristics of an ultra low power, high leakage environment, operating on a supply
voltage of just 0.5 volts. High leakage is a major concern for processes aiming for low
power consumption through the use of low voltage supplies. Thus, one of the goals of in
this project is to determine which crossbar network would perform best in a high leakage
environment.
Capacitance parameters for each of these circuits were extracted from the layout. These
capacitance parameters were then included in the circuit level simulations of the circuits.
Layout data were also used to estimate circuit area. Many SPICE simulations were
19
performed on each of the circuits. The simulations presented in this report were chosen to
portray specific situations relevant to throughput or power consumption.
Along with these electrical considerations, the issue of programming interconnect is
worthy of special attention. A configuration algorithm is presented for one of the
programmable interconnect structures, the Benes based approach. Key concerns in
interconnect controller design are addressed, although no transistor level design of these
controllers has been performed. The assessment of this configuration time control is based
on the published literature and the simulation of the interconnect structures themselves.
20
CHAPTER 2: PHYSICAL DESIGN CONSIDERATIONS FOR CROSSBARS IN
VLSI SYSTEMS
2.1 Crossbar design goals and applications
This document attempts to determine if certain crossbar architectures are better suited for
specific applications than others. To understand the potential advantages of a particular
crossbar in a given environment, it is necessary to be acquainted with the physical principles
that affect crossbar performance in that environment. This chapter summarizes some key
physical considerations that pertain to interconnect design, and, in particular, to crossbar
design. Once this background has been established, section 2.7 outlines the experimental
procedure used to compare crossbar architectures.
Historically, the goal of field programmable gate arrays (FPGAs) has been to produce a
high speed architecture and circuit with a reasonable logic density [5]. Power consumption
and tight reliability constraints have not been the driving force in the FPGA market. The
demand for application specific integrated circuits (ASICs) keeps growing, and
reconfigurable devices provide a more inexpensive alternative than custom fabrication runs.
As a result, it has become relevant to provide reconfigurable computing for applications
whose aim is different than that of traditional FPGAs. Of course everyone wants
applications to run as fast as possible, but for portable systems it is also desirable that
batteries last longer, and that the system is smaller. Even more critically, in life safety
applications, like fire alarms, and certain medical equipment such as cardiovascular
pacemakers, battery life and reliability are more pressing design concerns than a faster
clock frequency. Many of the services available today, such as weather reports, GPS
systems, and even the Internet, rely heavily on satellites. Electronics for outer space must
deliver the best possible performance, on a limited power budget, in a radiation intensive
environment. Accordingly, the effectiveness of programmable interconnect structures
21
should be examined under a wider range of environmental conditions, with a broader
spectrum of applications in mind.
The evaluation of crossbar performance for a variety of applications requires an
understanding of the underlying physical principles that govern circuit design in a particular
context. Standard issues that must be considered include throughput, power consumption,
circuit size, noise immunity and reliability. What constitutes acceptable performance in
each of these categories is strongly dependent on the application for which the circuit is
intended. This chapter provides an overview of possible application scenarios, and the
physical challenges for reconfigurable interconnect in each of these environments. First of
all, it must be noted that this study focuses on datapath interconnect, under the assumption
that data skew can be tolerated. As a result, this document does not include an in-depth
analysis of signal skew for the routing alternatives presented. Reconfigurable interconnect
structures such as the ones presented in this study are not suitable for clock routing, as skew
between alternative paths may be significant.
2.2 Transistor sizing and delay:
Transistor sizing is a major concern for achieving high performance. Small transistors
limit the current that a given circuit can handle, while large transistors have a higher
capacitive load. Tools such as logical effort theory [20] have been developed to assist in
transistor sizing and buffer insertion for optimal performance of CMOS circuits. However,
logical effort theory focuses on speed driven designs, rather than our context based
definition of high performance.
As a general trend, fabrication process technology seeks to shrink transistor sizes in order
to increase logic density. Having more transistors provides increased computational power
in every integrated circuit. Concurrently, many process engineers seek to develop faster
22
transistors, which may function in systems with higher clock frequencies, providing higher
throughput. The quest for smaller, faster devices has led to what is known as deep
submicron technologies, which are currently the state of the art in terms of logic density and
speed.
Notice that the goal of increasing logic density implies fitting more transistors into a
given die. Although transistor sizes have shrunk significantly, die sizes for most
applications do not scale as aggressively. Because signals must travel across the die, smaller
transistors do not necessarily translate into shorter wires. On the other hand, connecting to
smaller transistors often requires thinner wires. As a result, wire resistance tends to increase
as fabrication processes shrink. Capacitance remains fairly constant as feature sizes scale.
Thus, an increase in resistance causes an increase in interconnect delay. In contrast, gate
delay decreases as processes shrink. Overall, the balance in delay contribution is shifting
from gates to interconnect.
Certain fabrication process adjustments help mitigate this increase in wire delay. The
most common of these process variations include reducing both, oxide permitivity and metal
resistance. In fact, the impact of metal resistance in this context has emerged as a source of
controversy. Several experts believe that aluminum has run its course, and fabrication
processes must switch to lower resistance metals such as copper. Others defend the position
that the situation is not so dramatic yet, and a premature switch to copper would simply be
too expensive. In any case, when considering deep submicron fabrication processes,
interconnect-intensive circuits, such as crossbars, owe most of their signal delay to the metal
layers, rather than the transistors. Typically, smaller and faster transistors allow significant
increases in clock frequency, which translate into higher throughput. However, for the latest
fabrication processes, we might be reaching a point where interconnect delay, rather than
transistor switching is the limiting factor for maximum speed of operation.
23
2.3 Power consumption
Although increasing clock frequency has the desired effect of providing higher
throughput, issues like power consumption and noise immunity are actually hindered by a
higher switching rate. Recall the general expression for dynamic power consumption in
CMOS circuits,
P = C*f*Vdd2*u (2.1)
where P is the dynamic power consumption of the circuit, C is the load capacitance, f is the
operating frequency, Vdd is the supply voltage, and u is a utilization factor for the circuit
output. From equation (2.1), it follows that when CMOS circuits operate at higher
frequency they consume more dynamic power.
Reducing power consumption becomes important for portable system designs. In a
world full of mobile telephones, personal digital assistants, laptop computers, and other
battery powered applications, power budget is a crucial design specification. In particular,
space-borne applications pose the tightest constraints in terms of power budget. Even in
more conventional, non-portable systems, power dissipation remains an important concern
in circuit design. For instance, personal desktop computers that are plugged into an alternate
current (AC) outlet must limit their power consumption to prevent overheating of the
integrated circuits. Traditionally, gates have been the main sources of on-chip power
dissipation. However, just as it has happened with delay, the situation has changed. As
interconnect scales down, the spacing between wires becomes smaller. Thus, coupling
capacitances replace parallel-plate and fringing capacitances as the dominant source for
overall wire capacitance. Today, interconnect capacitance is the cause of most of the on-
chip power dissipation, especially for crossbars, which are inherently wiring intensive
circuits [4].
24
2.4 Noise immunity, leakage current, and other second order effects
One common way of reducing power dissipation has been to reduce the supply voltage,
from 5 V a few years ago, to 3.3 V, then 2.5 V and less. Recent work in ultra-low-power
CMOS has produced devices operating with a supply voltage of 0.5V [3]. As the supply
voltage is reduced, the MOS transistor thresholds are also reduced. This causes an increase
in sub-threshold leakage current, which increases static power consumption and degrades
switching performance, producing slower transitions. The
sharpness of a signal's rising and falling edges strongly affects the short circuit current.
Additionally, coupling noise may change the shape of the signal transition waveform in a
way that causes a higher short circuit current, further complicating the problem [4]. The
effects of leakage are most pronounced in circuits with large numbers of parallel transistors,
such as memories and crossbars [14]. In designing crossbars for low power applications, it
becomes relevant to consider the effects of leakage on throughput, power consumption, and
noise immunity.
At the other end of the spectrum, for extremely fast systems with decreasing rise times,
inductance becomes an important issue in signal integrity and delay. Inductance causes
ringing in the signal waveform, which can affect signal integrity if it exceeds allowed
thresholds. Also, with increasing wire lengths, the effective current loop is increased,
making mutual inductances a nonnegligible effect, which may further increase the short
circuit current.
Beyond an increase in power consumption, degraded signal waveforms cause other, more
dramatic, problems. Specifically, when signal integrity becomes compromised, noise
immunity is harder to achieve. For instance, in deep submicron technologies, interconnect
capacitance becomes comparable or even larger than gate capacitance. Thus, for dynamic
logic, the charge of interconnect may overwrite the content of the gate capacitance [3]. As
25
distance between wires scales down, crosstalk and charge sharing effects become more
significant. Furthermore, as supply voltages decrease, so does the threshold voltage. A
lower threshold voltage causes an increase in leakage current, which is most hazardous for
highly parallel structures, such as memories and crossbars [14]. These second order effects
call for more sophisticated circuit models, and capacitance extraction techniques.
Otherwise, Computer Aided Design (CAD) verification tools become a less reliable
predictor of actual circuit operation.
Whether one is pushing the limits of signal throughput, or circuit power consumption, it
seems that wire models currently available neglect important circuit dynamics. As an
example, for very fast circuits, the lumped-circuit representation is no longer adequate, since
it results in substantial underestimation of both crosstalk and delay. Hence, it would be
appropriate to model wires as transmission lines, rather than simple RC networks [4].
However, developing software that takes into account these considerations is costly, and
simulation times would probably increase beyond acceptable ranges. Because CAD
simulation results rely on such limited models, the designer must pay careful attention to
environmental conditions that may exacerbate these second order effects.
2.5 Reliability
Computer simulations cannot account for aging of the components, or statistical
variations within a fabrication process. Reliability analysis studies trends in integrated
circuit fabrication, and provides guidelines to ensure systems will work properly, for the
longest time possible, in spite of the aging of components. Reliability is also concerned with
the fact that no two integrated circuits are exactly alike, and yet, it must be possible to
replace a defective integrated circuit with an equivalent spare, and expect the system to work
correctly. In this context, reliability assessments depend upon the environment where the
integrated circuit must operate. For example, a high density of power dissipation creates hot
26
spots that cause temperature increases. If the temperature rises beyond the operating range,
the chip may fail to function [4].
There are two major sources of failure that affect interconnect: (1) electromigration, and
(2) formation of stress-induced voids, also called stress voiding. Electromigration occurs
when current is forced through metal lines whose width is not constant. Electrons tend to
accelerate when they reach the narrower sections of the wire, as shown in figure 2.1. As
electrons accelerate, they gain momentum. Eventually, the moving electrons collide with
the metal nuclei. The collision causes momentum to be transferred from the electrons to the
nuclei. Although the metallic nuclei are much heavier than the electrons, continued
collisions cause displacement of the metallic nuclei. Over time, these displaced nuclei can
lead to opens in a metal line. In fact, they may even lead to shorts, if the metal is displaced
in such a way that it provides a conduction path between lines that should be isolated from
each other.
Electromigration is primarily affected by operating temperature and current density in the
metal line. Higher current densities mean that more electrons are moving through the wire.
No fabrication process can guarantee that the width of their metal lines will be constant
throughout a chip. Thus, high current density is a major cause of electron migration. As
interconnect shrinks, the current density increases, making electromigration a growing
concern.
Figure 2.1: Electromigration in aluminum wires.
27
It must be noted that although electromigration is a problem for aluminum wires, copper
wires are not affected by this phenomenon. Copper has a lower resistivity than aluminum,
and copper nuclei are much heavier than aluminum nuclei. The replacement of aluminum by
copper would solve the electromigration problem, but the switch would represent a huge
investment for foundries. Instead, designers have chosen to prevent electromigration by
limiting the maximum current density that may run through an aluminum wire. As a rule of
thumb, current density is restricted to less than 1 mA/µm2. . To ensure the designed
crossbars do not suffer electromigration, current density has been calculated and monitored
throughout this project.
At microscopic levels, a single aluminum wire is made up of several grains, which may
be arranged in a variety of structures, as shown in figure 2.2.
Figure 2.2: Grain structures on VLSI metal wires.
Grain structure strongly affects the likelihood of metal wire failure due to stress voiding.
A triple point is a location along a metal wire where three grain boundaries intersect. Figure
28
2.2, shows that a bamboo grain structure is such that there are no triple points along the
wire. Stress voiding usually occurs at triple points. Therefore, an aluminum wire with a
bamboo grain structure is less prone to stress voiding failure than a polygranular wire.
Essentially, stress voiding results from a stress gradient acting on a triple point. After
annealing, different parts of a chip cool down at different rates. As materials cool down,
they contract. Different cool down rates result in different contraction rates. Consequently,
one side of a wire will feel more stress from the contracting materials than the other side. In
the presence of such a stress gradient, a wire will tend to break where the atomic bonds are
weakest, leading to open wire failures. Metal wires are weakest at triple points.
Electromigration aids the stress voiding process by making atomic bonds even weaker.
Unlike electromigration, techniques for dealing with stress voiding are mainly a fabrication
process issue, rather than a circuit design issue. One might argue that longer wires that span
a die from side to side are more likely candidates for stress voiding, because materials will
be more heterogeneous at both ends of the wire. Furthermore, a smaller layout area would
allow more room for redundant circuitry which
would increase the overall reliability of the system. As a result, at the layout level, attempts
have been made to keep metal wires as short as possible, and total layout area to a minimum.
It must be pointed out that stress voiding depends on several factors, and equating the
likelihood of stress voiding to wire length would be a gross oversimplification. A statistical
analysis of stress voiding for the crossbar architectures and fabrication processes falls
outside the scope of this project.
One of the most hostile environments for integrated circuits (ICs) is outer space. For
space borne applications, ICs must be able to deliver optimal performance in a radiation
intensive environment. Two approaches have been used to get around this issue. Some
vendors provide radiation hardened packaging for commercial off-the-shelf integrated
circuits. On the other hand, researchers around the world develop rad-hard-by-design
29
systems. Ultimately, these radiation tolerant designs are also encapsulated in rad-hard
packaging [19]. In addition to radiation tolerance, space grade ICs must comply with a
limited power budget. Furthermore, it is very expensive to replace a defective system in
space, so circuits must be designed to be both, highly reliable and fault tolerant. The various
space programs around the world are major consumers of ASICs, as the applications they
develop are frequently cutting edge technology. Reconfigurable systems provide the
versatility that such a dynamic field requires, at a fraction of the cost of custom fabrication
runs. On account of this close match between the needs of the space application market and
the advantages of reconfigurable computing, it becomes worthwhile to evaluate how
different crossbar topologies perform in a radiation intensive environment. In rad-hard by
design circuits, the need for guardbands and redundancy results in area overhead, which may
be a valid reason for choosing the smallest crossbar implementation for this kind of
application. Also, rad-hard ICs often include built in test structures, used to measure
performance of the chip when exposed to controlled amounts of radiation, which further
increase the area penalty. In summary, smaller designs allow more room for guardbands,
radiation hardening circuitry, and redundancy. If reliability is not the driving concern in the
design, a smaller area translates either into lower cost, or increased computational power, as
there is more room for additional transistors.
2.6 Programming, Routability, and Pin count
Regardless of the type of programmable switch used, the capacitance, resistance, and size
of programmable connections makes them much slower and larger than a simple metal wire
[5]. Besides all the electrical constraints, there are mechanical limitations that a crossbar
designer must consider. Note that crossbars may be designed to interconnect entire busses,
rather than individual bit lines. In such a case, the size of a routed crossbar increases
roughly by a factor of W, where W is the number of bits in the bus. Size concerns not
withstanding, the major challenge for a system with wide busses is routability. The space
30
available for routing is limited by two factors: layout design rules and the number of metal
layers available in the fabrication process. Using a higher number of metal layers may
reduce the footprint of the layout, but it offers no help regarding the routing path length or
the cost of the design. Moreover, routing wires through vias may even result in longer
delays for some signals [2].
The problem of running thousands of parallel wires through the patterns demanded by the
different crossbar architectures is not trivial. CAD tools provide routing algorithms, but
these often take hours or even days of computation time, for highly complex systems. More
importantly, it is not uncommon for automatic routing to yield less than optimal results. In
consequence, for large crossbar designs, it may be favorable to use the crossbar architecture
which provides the simplest, and most regular, routing pattern. Paraphrasing Chow et al. in
[6], layout styles should favor regularity, with the irregular structures restricted to the final
control logic.
While routing is a concern for VLSI wiring in general, reconfigurable interconnect must
also budget for the programming control logic. Programming logic not only influences size
and routability, but it also affects how much time it takes to configure the crossbar. Ideally,
one would prefer to load all the crossbar addresses in parallel, so that the entire crossbar is
configured in a single load operation. However, a real integrated circuit has a limited
number of pins available for interface to the outside world, so it is far more common to load
configurations in several steps. In other words, there is a pin count versus speed of
programming compromise. If a particular routing architecture uses fewer pins to configure
its connections, it may require more input/output (I/O) transfers to complete its
programming, thus increasing configuration time. Now, given a fixed I/O interface, a
crossbar architecture that needs fewer bits to program its connections, will take less time to
configure.
31
Typically, multistage interconnection networks (MINs) can be configured using fewer
bits than standard mesh crossbars. The price to be paid for such convenience lies in the
programming complexity of the crossbar architecture. In the mesh architecture, it takes only
one switch to build a path from a crossbar input to a crossbar output. Despite using fewer
address bits, programming a MIN is not as simple. In MINs, it takes several stages to link
inputs and outputs. Furthermore, using a particular path disables certain possible routes
between some inputs and outputs. As a result, it becomes necessary to perform additional
computations at configuration time. These computations must yield a valid switch
combination that realizes all the intended connections. Variations on the routing algorithms
used in computer networks often provide efficient ways to complete this task. And yet, for
certain systems, the additional “crossbar programming” step may be unacceptable,
especially for larger crossbars with too many alternative paths. In addition to all the
electrical and mechanical considerations, a practical design must consider the compromise
between routing flexibility and programming effort. On the one hand, it is relevant to
establish how the number of address bits scales with growing crossbars. On the other hand,
one must evaluate how much the programming complexity grows for more convoluted
crossbar architectures.
2.7 Ground rules for crossbar study
In order to provide a meaningful comparison of the crossbar alternatives, it is necessary
to establish a common context in which to work. The ideal would be to provide identical
conditions for the three architectures, so that any trends observed can be attributed to
specific crossbar circuits and their inherent properties. Wherever possible, care has been
taken to ensure an identical environment, but, due to fundamental differences in the
architectures themselves, certain compromises have been made. This
32
section describes the simulation environment, and the general procedure used in obtaining
data for all the circuits being compared.
Fifteen different layouts were generated using the AMI 0.6 micron technology definitions
available for Cadence. Each of the three approaches was laid out in three different sizes:
4x4, 8x8, and 16x16 crossbars. Additionally, the full crossbar approach and the Benes
interconnection network were laid out in two versions for each size. One version used full
pass gates as switches, while the other one used NMOS transistors as switches.
Capacitances were extracted for each of these nine layouts. A basic SPICE netlist
describing the transistors and parasitic capacitances was obtained for each layout.
Unfortunately, the technology files assumed metal layer resistances to be zero, and thus
automated resistance extraction was impossible. Alternatively, wire lengths and widths
were measured, and resistances were estimated for each input/output path. Although these
values are analyzed theoretically in discussions about crossbar delay for each architecture,
the resistance values were not incorporated into circuit level simulations. For larger feature
sizes, where λ is greater than 0.3 microns, the observations presented here are valid, because
wire resistance is indeed negligible. However, as processes scale into deep submicron
technologies, better circuit models would be necessary to provide meaningful comparisons.
When first laid out, all transistors in the netlist were associated with AMI 0.6 micron
transistor models, as provided by MOSIS. Spice transistor models for the ultra low power
(ULP) fabrication process were obtained through the Center for Advanced Microelectronics
and Biomolecular Research (CAMBR) at the University of Idaho. The ULP transistor
model was inserted into the nine original netlists, providing nine additional circuits for
comparison. In all, eighteen circuits were simulated for various input/output (I/O) patterns.
The simulation results were used to compute average signal delays, total leakage current,
and power consumption. Figure 2.3 summarizes the procedure for obtaining SPICE
simulation results based on circuit layouts.
33
Figure 2.3: Obtaining Spice simulations from Cadence layout.
It must be stressed that this is not a true ULP layout, but an AMI netlist using ULP
transistor models. Consequently, simulation results provided here may differ from those
obtained for a circuit specifically laid out for ULP. In particular, parasitic capacitances
associated with ULP may be very different than the values extracted for AMI. The point of
34
using ULP models in this study is to examine crossbar performance in systems where
leakage currents are not negligible. Transistor models with very low threshold voltages,
such as those obtained for ULP, mimic the leakage conditions being targeted.
Closely related to the issue of measuring delay is the question of how to drive and load
the simulation test circuits. In [19], Sutherland provides several recommendations and
examples aimed at generating realistic edges for transistor level simulations. Following
these recommendations, all simulation inputs are driven using two cascaded inverters, whose
input is connected to the direct current (DC) or pulse sources available in Smartspice®.
Similarly, the outputs are loaded with three cascaded inverters, followed by a 0.1 picofarad
capacitor. The output is then measured after the second output inverter. The test circuitry
for one input/output pair is depicted in figure 2.4. Delay is measured from point A to point
B in the figure. For this project, delay is defined at 50% of the full voltage swing. Since the
AMI process uses a 3.3 volt supply, delay was measured between the time the input reached
1.65 volts, and the time the output reached that same voltage. For the ULP process the
switch point was 0.25 volts, which is half of the 0.5 volt supply.
Figure 2.4: Circuit for measuring crossbar delay
Note that such measured delay is the sum of the crossbar delay plus the delay across two
inverters. To obtain the net crossbar delay, one must subtract the delay across two inverters.
The test circuit illustrated in figure 2.5 was simulated to determine the delay across two
cascaded inverters driving a third inverter of the same size. The load seen at point D in
figure 2.5 is the same as the load seen at point B in figure 2.4. Therefore, the delay between
35
points C and D is the value that must be subtracted from the original data. Table 2.1 shows
the inverter delay values for the two fabrication processes used.
Figure 2.5: Circuit for measuring delay across two inverters.
Table 2.1: Delay across two inverters driving a 0.1 pF capacitance.
Process Rising transition Falling transition Average across two inverters
AMI 0.6 0.285 ns 0.285 ns 0.285 ns
ULP 0.896 ns 0.786 ns 0.841 ns
From Table 2.1 it is evident that AMI transistors are faster than their ULP counterparts.
This behavior is consistent with the fact that AMI processes operate on a 3.3 volt power
supply, while ULP operates on 0.5 volts. The difference in rising and falling transition
delays for the ULP process is a consequence of the netlist generation procedure described
above. The layouts were built using AMI 0.6 standard cells and AMI 0.6 layout design
rules. Transistors in the layout were sized for equal rise and fall times on the AMI 0.6
fabrication process, as evidenced in the first row of Table 2.1. When the netlists were
modified to utilize ULP transistor models, transistor sizes were not modified. Consequently,
rise and fall times are slightly different on ULP simulation runs. For the remainder of this
document, the word “delay” refers to the average of rising transition delay and falling
transition delay, unless otherwise specified.
36
All Spice simulations in this project ran for one microsecond, with a 0.1 nanosecond time
step. All simulation runs used to measure delays were such that they included at least three
falling transitions and three rising transitions for each I/O pair. In every case, the delay for
each transition was measured, and then two averages were computed, one for rising
transitions and one for falling transitions. Additionally, all delays were averaged to get a
delay estimate for a given I/O pair in a specific simulation run. Thus, for each I/O pair
being connected, a particular simulation yields an average delay, an average rising transition
delay, and an average falling transition delay. All of the
averages computed in this manner are then compared to get a worst case delay for the
crossbar. Furthermore, the delays are averaged again across all I/O pairs, to get an average
delay for the crossbar implementation.
Dynamic power consumption generally depends on how often signals toggle between
voltage rails. Such signal activity is specific to the operations being performed and the data
being processed. Because no thorough set of benchmarks has been defined to assess
crossbar power consumption, the estimates presented in this report should not be regarded as
absolutes. Within the context of this project, care has been taken to ensure circuits operated
at the same frequency and with the same I/O activity. In other words, all circuits being
compared have been subjected to the same testbenches.
Power consumption was gauged by monitoring source current, at each input, on every
simulation run. Two values were monitored for each current sample: Ileak and Ipeak. The
direct current (DC) component of the current waveforms was isolated. In most cases, this
value represents the leakage current for a particular input. Leakage across all inputs was
added for every simulation run, and is referred to as Ileak for the remainder of this document.
Ileak was averaged across simulation runs to generate an Ileak value for each circuit. Average
static power consumption was computed by multiplying Ileak by the supply voltage for each
circuit. The absolute value of the largest current peak for each circuit was labeled as Ipeak for
37
that circuit. The value of Ipeak is directly proportional to dynamic power consumption in the
circuit. Thus, comparing Ipeak for different circuits provides insight into dynamic power
consumption of the circuit.
Although all outputs have identical loads, the capacitance seen from a particular input to
the crossbar depends on whether the switch is on or off. If the switch is off, the contribution
to the capacitance is due to the switch transistor itself, and in the case of ultra low power
processes, this load may be significant [14]. If the switch is on, then all of the circuitry on
the other side of the switch will contribute to the load seen by the driving signal. This
means that, when multicasting inputs, the loading will be a function of the number of
crossbar switches that are turned on.
A related problem may arise between the time after chip power-up, but before the
crossbar programming has been downloaded. At this point, the configuration bits have not
been programmed, and it is possible that two or more inputs may be trying to drive the same
output. [6] presents two alternatives to deal with this problem. The first possibility is to
disable the output drivers, using a tristate buffer, during power up and programming. This
design impacts the speed of the crossbar, because the disabling transistors of the tristate
buffer are in the critical path. The second alternative involves adding a global control
signal, which is activated during power up and disables all crossbar connections [6].
Next to interconnect, the most important element in a crossbar is the programmable
switch. The design of the switches will have a significant impact on the area and speed of
the array. FPGA's and other reconfigurable computing applications have used several
switching methodologies, which include Static Random Access Memory (SRAM) controlled
pass transistors and pass gates, antifuses, and even floating-gate-based switching [6]. For
this study, due to the lack of proper electrical models for simulation, only two alternatives
were considered. The decision was whether to use full transmission gates, or single n-
channel pass transistors as switches.
38
The most obvious advantage of using a single n-channel pass transistor is a decrease in
area. The area benefit is not just because you have half the transistors, but also because of a
savings in well areas. Well areas might be significant, depending on whether the switches
are scattered enough that individual wells must be provided for each transmission gate. In
addition, internal nodes of n-channel pass transistor circuits may not require contacts, which
would save area and reduce junction capacitance. There is also a savings in routing area, as
the complementary address signal is not required to program the crossbar [6].
A major drawback of the single transistor approach is signal integrity. The pass
transistor causes a voltage drop equal to the threshold voltage (Vt) when passing highs, and
has a slower rise time towards the end of a low-to-high transition. These effects can be
partially compensated by lowering the switching threshold of the succeeding gates.
Unfortunately, the reduced high also means that the pull-up transistor in succeeding gates
will not be fully turned off, resulting in static power dissipation. This becomes a greater
problem when the doping is not adjusted to reduce Vt in the face of reduced supply voltages.
In such a case, the noise margins are severely degraded. Chapters 3 and 4 illustrate the
tradeoffs in using a single pass transistor to build the simple mesh and the Benes crossbars.
As stated in section 2.4, CAD tools present an important limitation for this project,
because the wire models available neglect interconnect properties. Although most
synthesizers and simulators model interconnect as a resistance-capacitance (RC) tree, the
resistance parameter is often assumed to be zero, which produces simulation results in which
wire delay has been neglected. At higher clock frequencies, the RC tree models are not even
appropriate, as interconnect starts to behave as a transmission line. Parasitic capacitances
were extracted from the layout. In an effort to recognize the increasing importance of
interconnect delay over gate delay, wire lengths were compared across the different crossbar
architectures. By making sure that wire widths are the same for all lines, one may assume
39
that wire resistance will be directly proportional to wire length. Switching rates were kept
below 20 MHz, to ensure that the RC tree models are still valid.
40
CHAPTER 3: THE SIMPLE MESH CROSSBAR
3.1 General circuit description
The simple mesh crossbar is essentially a grid of metal lines. The horizontal lines are
crossbar inputs. The vertical lines are crossbar outputs. At any point of intersection
between an input and an output, a switch selects whether the horizontal line and the vertical
line should be connected together. The control line which turns the switch on or off is
known as SELXX, where XX is a number that identifies each pass switch. These switches
can be either full pass gates, or just NMOS transistors. Figure 3.1 depicts this arrangement
for the 4x4 crossbar, using full pass gates as switches.
41
Figure 3.1: 4x4 simple mesh crossbar.
The outputs are laid out in the layer of metal closest to the transistors, which is known as
metal 1. The horizontal lines are crossbar inputs, and they are routed in metal 2, the layer
above metal one. Metal 3 is reserved for power and ground connections. Figure 3.2 shows
layout for the 4x4 simple mesh.
42
Figure 3.2: Layout for simple mesh crossbar.
3.2 Transistor Count and Area
The simple mesh crossbar requires an independent connection for each input/output (I/O)
pair. The I/O connecting switch can be either a full pass gate or an NMOS pass transistor.
In the case of a full pass gate design, each one of these connections requires two PMOS and
two NMOS transistors, for a total of four devices. One PMOS/NMOS pair makes up the
pass gate. The PMOS and NMOS transistors in the pass gate must receive complementary
signals at their transistor gates, for the switch to function properly. Thus, the other two
transistors provide an inverter for the SELXX signal. Figure 3.3 illustrates the transistor
circuit for one I/O connection.
43
Figure 3.3: Transistor circuit for one I/O connection, using a full
transmission gate as a switch.
Table 3.1 summarizes transistor count and layout area for the three crossbar sizes
constructed for this study. Notice that transistor count is four times higher for the full pass
gate implementation than for the single pass transistor approach.
Table 3.1: Transistor Count and Area
Single NMOS switch Full pass gate switch
Crossbar size Transistor count Area (μm2) Transistor count Area (μm2)
4x4 16 4,044 64 11,729
8x8 64 15,282 256 46,915
16x16 256 62,346 1024 187,661
The actual layout for these circuits was done “by hand”. That is, no automatic tool
generated the layout. The transistors were actually drawn and placed by the circuit designer,
based on standard cells for simple logic gates. As a result, the computed areas may vary for
other layout constructions. The area for the pass transistor mesh is
44
approximately one third of the area for the full pass gate version. Based on transistor count,
the pass transistor circuit could probably be compressed to less than 25% of the area for the
full pass gate crossbar. More important than the specific area of any one circuit, the key
issue concerning area is scalability. In other words, how does total circuit area scale with
increasing I/O connections? From Table 3.1 it follows that total area is proportional to N2,
where N is the number of crossbar inputs in an NxN crossbar. Figure 3.4 shows the scaling
trend for such a crossbar.
Scaling trend for simple mesh crossbar
0
50000
100000
150000
200000
4x4 8x8 16x16
Crossbar size
Layo
ut a
rea
(squ
are
mic
rons
)
Full pass gate switchNMOS switch
Figure 3.4: Scaling trend for simple mesh crossbar.
3.3 Delay estimations
In the case of the simple mesh crossbar, all inputs must go through exactly one switch,
which is responsible for most of the I/O delay. Additional delay is caused by parasitic
capacitances, which were modeled by the layout extraction procedure. I/O delay was
measured for each input/output pair under several configurations. For this approach it is
45
possible to turn on just one I/O connection at a time. However, both the Benes network and
the synthesized circuit lack this ability. As a result, to keep comparisons fair, this study only
considers configurations where every output is being toggled by some input.
Tables 3.2 and 3.3 summarize average delays for the three different sizes of the simple
mesh crossbar, across the two fabrication processes being studied. Notice the difference in
rising-transition delay and falling-transition delay for the crossbars in Table 3.2. Remember
that these crossbars use a single NMOS transistor as a programmable switch.
Table 3.2: Delay summary for simple mesh crossbar using NMOS transistors as
switches (delays in nanoseconds).
AMI ULP
Size Rising Falling Avg. Rising Falling Avg.
4x4 1.15 1.21 1.18 1.54 2.19 1.87
8x8 2.74 3.93 3.34 5.09 6.3 5.70
16x16 4.37 5.54 4.96 8.03 11.2 9.62
Table 3.3: Delay summary for simple mesh crossbar using full pass gates as
switches (delays in nanoseconds).
AMI ULP
Size Rising Falling Avg. Rising Falling Avg.
4x4 0.54 0.54 0.54 0.89 0.83 0.86
8x8 1.54 1.58 1.56 3.19 3.51 3.35
16x16 2.39 2.73 2.56 5.91 6.05 5.98
Figure 3.5 shows an NMOS transistor acting as a switch for a rising transition as well as
a falling transition. On a rising transition, the NMOS is initially in the triode region. As the
46
load capacitance charges towards Vdd, the NMOS gradually turns off. In contrast, before a
falling transition, the NMOS switch is cut off, and it turns on when the input falls below
(Vdd-Vt). While the transistor is off, the voltage swing at the input is not propagated to the
output. In addition to the time it takes the input to drop one threshold voltage, it takes the
transistor a certain amount of time to come out of the cut-off region. Beyond that, it takes a
certain amount of time for the output to reach the voltage at which delay is being measured,
in this case, Vdd/2. Because a rising transition does not need to wait for the switch to turn on
before the signal is propagated, delay measured on rising transitions is less than delay
measured on falling transitions. Note that full pass gates do not exhibit this behavior,
because at least one of the transistors in the pass gate is always turned on.
Figure 3.5: Rising and falling transitions using an NMOS switch.
47
Figures 3.6 and 3.7 illustrate the behavior of average delay as crossbars become larger.
The AMI process is generally faster than the version using ULP transistors. Additionally,
the circuit using pass gates as programmable switches is roughly twice as fast as the version
that only uses NMOS transistors. Although three points do not provide enough information
to extrapolate a best fit curve, it seems that delay increases by more than 200% for the 8x8
crossbar with respect to the 4x4 crossbar, but it only increases by about 70% between the
8x8 crossbar and the 16x16 crossbar.
Delay for crossbar based on NMOS switches
0
2
4
6
8
10
12
4x4 8x8 16x16
Crossbar size
Del
ay in
nan
osec
onds
AMIULP
Figure 3.6: Average delay for the pass transistor version
of the simple mesh crossbar.
48
Delay for crossbars based on full pass gates
0
1
2
34
5
6
7
4x4 8x8 16x16
Crossbar size
Del
ay in
nan
osec
onds
AMIULP
Figure 3.7: Average delay for the full pass gate version
of the simple mesh crossbar.
To understand why delay would grow less dramatically for larger crossbars, one must
understand how that delay is being modeled. Each I/O connection can be modeled as a
resistance-capacitance (RC) circuit. Recall that wire resistance is assumed to be zero for
these analyses. Thus, the resistive component of the circuit is provided by the dynamic
resistance of the pass transistors. Note that this resistive component is independent of
crossbar size, as every I/O path contains exactly one pass transistor. Based on this premise,
variations in delay would depend entirely on capacitances.
Diffusion capacitances for the programmable switches contribute the same amount of
capacitance to each I/O path, regardless of crossbar size. Hence, the only capacitance
contribution that is affected by crossbar size is parasitic capacitance. For this project,
parasitic capacitance was extracted from the layout using Virtuoso®, from Cadence ®.
Capacitance extraction generates lumped capacitors at different nodes in the circuit.
Average parasitic capacitance was computed by adding all extracted capacitances and
dividing by the total number of extracted capacitors. Table 3.4 presents average parasitic
49
capacitances obtained from the layout for the three crossbar sizes. Notice that average
parasitics seem to double as the number of crossbar inputs doubles. The contribution of
capacitance between metal lines depends on the distance between these metal lines. Wires
that are closer together have more influence on delay than those spaced far apart. The metal
lines added to the 4x4 crossbar to make it an 8x8 crossbar have a significant impact on
delay, because the distance between the preexisting lines and the new lines is relatively
small. In other words, the 8x8 crossbar is small enough that most metal lines represent a
significant contribution to parasitic capacitance at any point in the circuit. The 16x16
version is larger, so that only the lines within a certain radius affect parasitics on any given
line.
Table 3.4: Average parasitic capacitances for simple mesh crossbar
Crossbar size Average extracted capacitance
4x4 10.7 fF
8x8 21.2 fF
16x16 41.7 fF
For the fabrication processes being compared, wire resistance is negligible compared to
the dynamic resistance of transistors. In fact, fabrication process models developed by the
foundries themselves neglect the effects of wire resistance. In an effort to recognize the
increasing role of interconnect delay in overall circuit performance, Table 3.5 lists the
longest paths and their respective wire lengths. Note the effect of scaling in wire length.
Doubling crossbar size essentially doubles worst case wire length across the crossbar.
50
Table 3.5: Worst case wire lengths
Crossbar size Longest path Length of longest path
4x4 In0 to Out3 187.65 μm
8x8 In0 to Out7 404.6 μm
16x16 In0 to Out15 777.6 μm
3.4 Power Consumption:
Theoretically, CMOS circuits are often said to consume no static power, because, unless
there is an input transition, there is no direct path for current to flow between the power
rails. In real transistors, even if transistors are cut off, there is a small leakage current
running through the transistors.
Let us define Itot as the total current drawn by a crossbar on a given simulation run.
Figure 3.8 shows typical Itot curves for the 4x4 simple mesh crossbar using only NMOS
transistors. Curve A corresponds to a simulation run using AMI transistor models. Curve B
depicts the results for the same simulation run using ULP transistor models. The crossbar is
operating without multicasting. For the first 30 ns, all data inputs to the crossbars are low.
At 30 ns, all inputs to the crossbar switch to high. All inputs stay high for at least 100 ns
more. Beyond the first 130 ns, each input toggles at a different rate.
51
Figure 3.8: Itot for mesh with NMOS switches.
Curve A for AMI, and Curve B for ULP.
Figure 3.9 presents the load seen by any data input line in the circuit. Transistors T1
through T4 provide connections from the input to any one of the outputs. Observe that in
one-to-one operation, only one of the switches will be turned on at any given time.
Consequently, the input source must provide enough current to drive the load on one output,
plus the leakage current drawn by the three switches that are turned off.
Figure 3.9: Load seen by one crossbar input.
52
Going back to our analysis of figure 3.8, since all data voltages are zero for the first 30 ns
of the simulation, all transistors are cut off. When all switches are cut off, the current drawn
by the circuit approximates the true leakage current of the system. Note that, under these
conditions, the ULP transistors produce more leakage than their AMI counterparts, just as
expected. At 30 ns there is a current spike, consistent with the low to high transition of the
inputs. At this point, capacitive loads on the outputs of active NMOS switches must be
charged, drawing current from the input sources. These current spikes are responsible for
dynamic power consumption of the circuit. Once the load capacitances have been charged,
current consumption falls back to leakage current levels. At 130 ns, another current spike
indicates that at least one of the inputs has transitioned from high to low.
So far, power consumption has behaved exactly as predicted in the literature. Hass et al.
have pointed out that leakage may be significant for the ULP process, because of the low
threshold voltage associated with ULP [14]. Furthermore, they mention that highly parallel
structures, such as crossbars or memories are more likely to be affected by high leakage
currents. The curves in figure 3.6 show that for the first 130 ns of the simulation, ULP does
indeed consume more static power than AMI. However, after 130 ns, static power
consumption for AMI is far larger than that of ULP.
Recall, however, that using a single pass transistor leads to narrower noise margins and
higher static power consumption. Figure 3.10 shows an unbuffered crossbar output for the
simulation above, using the AMI transistors. The corresponding input is shown as a
reference. Figure 3.11 is the result of repeating the simulation using ULP transistors. The
static power dissipation evident in the Itot curves is a consequence of the degraded voltage
levels seen at the crossbar outputs. The fact that the output voltage stays between 0.3 volts
and 3 volts is evidence that transistors are not fully turned off, thereby increasing static
power consumption. Figures 3.7, 3.9, and 3.10 illustrate the inherent shortcomings of
using a pass transistor as a programmable switch.
53
Figure 3.10: Output voltage degradation for the AMI crossbar.
Figure 3.11: Output voltage degradation for the ULP crossbar.
54
Figure 3.12 is equivalent to figure 3.8, except that the circuits simulated use full pass
gates as programmable switches. Note that, in this case, the AMI circuit shows virtually no
static power consumption. However, the AMI circuit also shows larger current peaks on
signal transitions. On the other hand, the ULP process shows significant leakage, with much
smaller current spikes on signal transitions. As expected, ULP consumes far less dynamic
power, at the expense of higher leakage.
Figure 3.12: Total current for mesh with full pass gates as switches.
Curve A for AMI, and Curve B for ULP.
One of the main goals of this study is to determine which crossbar architecture works
best in spite of leaky transistors. It is interesting to note that the choice of programmable
switch crucially alters this assessment. For an implementation using pass transistors, the
static power consumption due to degraded voltage levels overshadows the effects of leakage.
In that case, static power consumption is more closely related to the process threshold
55
voltages. Higher threshold voltages translate into more voltage degradation, so that the
voltage levels are often insufficient to fully cut off transistors. In contrast, for a full pass
gate design, static power consumption is directly proportional to leakage current. Tables 3.6
and 3.7 summarize leakage current, average static power consumption, and Ipeak, for all the
circuits being compared.
Table 3.6: Power consumption summary for simple mesh crossbar using NMOS
transistors as switches. Currents in μA, power in milliwatts.
AMI ULP
Size Ileak Ipeak Static power Ileak Ipeak Static power
4x4 0 835 0.58 17.9 65 0.009
8x8 0 2,254 1.47 49.5 221.9 0.025
16x16 0.3 5,313 1.88 93 528.4 0.047
Table 3.7: Power consumption summary for simple mesh crossbar using full pass
gates as switches. Currents in μA, power in milliwatts.
AMI ULP
Size Ileak Ipeak Static power Ileak Ipeak Static power
4x4 0 875 0 8.2 105.3 0.004
8x8 0 2,293 0 26 221.8 0.013
16x16 0 5,290 0 92.9 508.3 0.046
When NMOS transistors are used as switches, the AMI crossbars consume more static
power than their ULP counterparts. However, the full pass gate implementation yields no
static power consumption at all for the AMI crossbar. At any rate, static power consumption
is in the order of microwatts for the crossbar sizes in this study. Even if each crossbar input
were a 16-bit bus, worst case static power dissipation for a 16x16 crossbar would be on the
56
order of one milliwatt. Table 3.6 and figure 3.8 show that static power dissipation due to
degraded output voltages is far more significant than static power dissipation due to leakage
current, even for a high leakage process such as ULP.
Comparing the values for Ipeak across fabrication processes, signal transitions on the AMI
crossbar draw around ten times more current than the corresponding ULP transitions. In
other words, AMI consumes much more dynamic power than ULP, for a given operating
frequency and capacitive load.
3.5 Other considerations:
In general, VLSI circuits are compared according to speed, layout area, and power
consumption, as the previous sections have described. According to their specific function,
crossbars should also be compared in terms of their architectural efficiency. As a figure of
merit, architectural efficiency seeks to summarize the compromises between programming
flexibility, number of I/O pins, signal delay, and design complexity. It must be pointed out
that architectural efficiency does not take into account any electrical parameters of the actual
crossbar circuit. The computed efficiency for a given architecture is independent of the
fabrication process, or the actual layout information of the circuit. In a sense, architectural
efficiency serves as a first order approximation for comparing crossbar architectures,
regardless of the actual implementation details. Table 3.8 presents architectural efficiency
for the simple mesh crossbar.
57
Table 3.8: Architectural efficiency for the simple mesh
Size I/O
connections
Transistor
count
Delay elements per
connection
Utilization Arch. Eff.
4x4 16 64 1 0.25 0.0625
8x8 64 256 1 0.125 0.03125
16x16 256 1024 1 0.0625 0.01563
Table 3.8 shows that a 4x4 simple mesh crossbar is a more efficient circuit than the 8x8,
which in turn is more efficient than the 16x16. This means that although a larger crossbar is
more flexible, the price being paid in terms of transistor count, utilization, and delay grows
faster than the number of connection alternatives achieved.
Among the qualitative advantages of the simple mesh architecture, the most obvious is its
low design complexity. This circuit yields an extremely regular layout that can be tiled to
produce crossbars of different sizes. Yet, the greatest drawback for this circuit is its poor
scalability. The number of switches increases quadratically with crossbar size. Besides the
obvious impact such growth has on transistor count and area, for the simple mesh it also
means a quadratic increase in the number of configuration signals required. Depending on
the implementation of the programming interface, a sharp increase in configuration signals
can lead to either longer configuration times, or greater pin counts.
Concerns about circuit area become even more pressing for applications where reliability
is an issue. Fault tolerant systems rely on redundant circuitry, which can easily double or
triple circuit area and transistor count. Some fault detection and correction schemes even
call for five copies of a given circuit. Other reliability issues, such as radiation tolerance,
impose restrictions on the separation between circuit components. Radiation tolerance also
uses guardbands, and other layout techniques, to improve circuit reliability in a harsh
environment.
58
The simple mesh approach is best suited for small crossbars, like the ones presented in
this study. The ideal application for a simple mesh crossbar is one where pin count is not a
limiting factor. Alternatively, when pins are limited, a simple mesh crossbar would still be a
good choice, as long as slower configuration times can be tolerated.
59
CHAPTER 4: MULTIPLEXER-BASED CROSSBAR
4.1 General circuit description
It is possible to synthesize a crossbar from a hardware description language (HDL)
representation. First, the synthesizer generates a circuit that produces the behavior specified
by the HDL description. The synthesized circuit is then mapped onto a standard cell library,
which is associated with a specific fabrication process. A standard cell library can be
thought of as the set of building blocks that may be used to implement the circuit in a
particular technology. Different standard cell libraries may contain different building
blocks. Thus, the final synthesized circuit depends on the standard cell library used.
The final implementation also depends on the synthesizer itself, and its ability to translate
HDL instructions into logic circuits. In fact, it is possible to write HDL code that can be
simulated but not synthesized. Figure 4.1 shows VHDL code for the 4x4 bidirectional
crossbar. Figure 4.2 shows the synthesized circuit for this code. Note that the “with-select”
statements have synthesized to multiplexers, while the “if” statement has synthesized to
tristate buffers. In general, “case” and “select” clauses synthesize to multiplexers, but “if”
statements are mapped to individual tristate buffers. Notice that any given algorithm may be
described in several different ways in HDL. At the behavioral level, the choice between an
“if” and a “case” is based on convenience. However, when coding for synthesis, the
designer should write the HDL description keeping the target circuit in mind. Although
multiplexers and tristate buffers may accomplish the same logical behavior, their physical
characteristics are often very different.
60library ieee;
use ieee.std_logic_1164.all;
use ieee.std_logic_arith.all;
entity xbar is
port( dir: in std_logic;
d0,d1,d2,d3: inout std_logic;
sa0,sa1,sa2,sa3: in std_logic_vector(1 downto 0);
xb0,xb1,xb2,xb3: inout std_logic);
end xbar;
architecture BEHAVE of xbar is
begin
if dir='0' then
with sa0 select
xb0 <= d0 when "00", d1 when "01", d2 when "10", d3 when "11", d0 when others;
with sa1 select
xb1 <= d0 when "00", d1 when "01", d2 when "10", d3 when "11", d0 when others;
with sa2 select
xb2 <= d0 when "00", d1 when "01", d2 when "10", d3 when "11", d0 when others;
with sa3 select
xb3 <= d0 when "00", d1 when "01", d2 when "10", d3 when "11", d0 when others;
elsif dir='1' then
with sa0 select
d0 <= xb0 when "00", xb1 when "01", xb2 when "10", xb3 when "11", xb0 when others;
with sa1 select
d1 <= xb0 when "00", xb1 when "01", xb2 when "10", xb3 when "11", xb0 when others;
with sa2 select
d2 <= xb0 when "00", xb1 when "01", xb2 when "10", xb3 when "11", xb0 when others;
with sa3 select
d3 <= xb0 when "00", xb1 when "01", xb2 when "10", xb3 when "11", xb0 when others;
end if;
end BEHAVE;
Figure 4.1: VHDL code for 4x4 bidirectional crossbar.
61
Figure 4.2: Synthesized 4x4 bidirectional crossbar
The standard cell library for the AMI process was provided by the Electrical Engineering
Department at North Carolina State University (NCSU), through their NCSU Cadence
package. The NCSU standard cell library does not include a four-to-one multiplexer. Thus,
each of the four-to-one multiplexers shown in figure 4.2 was constructed using two input
multiplexers, as shown in figure 4.3.
62
Figure 4.3: Four to one MUX implementation.
In fact, the only multiplexer included in the AMI standard cell library available with the
NCSU package is a two-to-one multiplexer. Any larger multiplexer can be built using this
basic multiplexer. Thus, all crossbars synthesized from the AMI standard cell library consist
of two-to-one multiplexers, tristate buffers, and some combinational logic.
Layout for this circuit was obtained using Silicon Ensemble ® from Cadence ®. Silicon
Ensemble is an automated place and route tool. The routed circuit is then exported to the
Virtuoso ® Layout Editor using GDSII format. After performing a Design Rule Check
(DRC) on the layout, parasitic capacitances are extracted using the same procedure followed
for custom layout crossbars. SPICE netlists are then generated and simulated as explained
in chapter 2. Figure 4.4 presents layout for the 4x4 synthesized crossbar. Other than power
and ground rails, metal lines do not run straight from inputs to outputs, as they do on custom
layout for crossbars. Instead, inputs and outputs follow convoluted patterns, often spanning
two or even three metal layers. The use of vias for datapath has a significant impact on
delay that CAD tools used for this project do not model.
63
Figure 4.4: Layout for the synthesized 4x4 crossbar.
4.2 Transistor Count and Area
The basic building block for the synthesized crossbar is the two to one multiplexer, as
implemented in NCSU’s AMI standard cell library. Figure 4.5 depicts the corresponding
transistor circuit for the two to one multiplexer. Other standard cells used in the crossbar
netlist include two-input NAND gates, two-input NOR gates, four input AOI gates, and
inverters.
64
Figure 4.5: Transistor circuit for a two-to-one multiplexer.
Table 4.1 summarizes transistor count and layout area for the three crossbar sizes
constructed for this study. Transistor count was obtained from Design Compiler ®, the
synthesis tool that generated the circuits based on the VHDL description.
Table 4.1: Transistor count and layout area
Crossbar size Transistor count Area (μm2)
4x4 248 13,870
8x8 1168 59,302
16x16 5032 257,568
It must be pointed out that, because the layout was generated using an automatic tool,
logic density for the synthesized crossbar is greater than for the other two crossbar
implementations. Place and route tools allow the user to define a target area for their layout.
Additionally, the circuit designer is asked for a utilization factor, and a mapping effort. The
layout data summarized in table 4.1 was obtained using 75% utilization with mapping effort
set to low. This means that circuit area can probably be reduced by at least 25% from the
values presented above. As with the simple mesh, because area is somewhat dependent on
65
judgement calls made by the circuit designer, actual area may vary for other place and route
runs of the same netlist.
The key advantage of automatic place and route is that design time is reduced drastically.
Not only are tedious and repetitive tasks done automatically, but the likelihood of an error
on the layout becomes minimal. In general, place and route tools will also achieve denser
layouts than custom designs. Observe, however, that layout area for the synthesized
crossbar is actually larger than that for a simple mesh crossbar. The synthesizer is
constrained by the specific instructions in the HDL description, and the standard cells
available in the library. A human designer or a specialized crossbar generator can spot the
underlying regularity in crossbar behavior, and come up with alternative circuits that
optimize area, speed, or power consumption, depending on the intended crossbar
application.
Figure 4.6 illustrates how area scales, according to the data in table 3.1. Transistor count
increases by a factor of four between the 4x4 and the 8x8 crossbar. The scaling between an
8x8 crossbar and a 16x16 crossbar corresponds also to a factor of four. Target area for each
layout was chosen according to this trend.
66
Scaling trend for synthesized crossbar
0
50,000
100,000
150,000
200,000
250,000
300,000
4x4 8x8 16x16
Crossbar size
Are
a in
squ
are
mic
rons
Layout area
Figure 4.6: Scaling trend for synthesized crossbar.
4.3 Delay estimations
In a multiplexer-based crossbar, the number of delay elements that a signal must go
through varies with crossbar size. Because datapaths are constructed using two to one
multiplexers, doubling the number of inputs results in an additional multiplexer on each line.
Additional delay is caused by parasitic capacitances, which were modeled by the layout
extraction procedure. I/O delay was measured for each input/output pair under several
configurations.
Table 4.2 summarizes average delays for the three different sizes of the synthesized
crossbar, across the two fabrication processes being studied. Notice there is little difference
in rising-transition delay and falling-transition delay for the simulations using AMI
transistor models. Transistors in the standard cells have been sized for equal rise and fall
times. Because carrier mobility is different in the ULP models, the difference in delay for
rising and falling transitions is more noticeable.
67
Table 4.2: Delay summary for multiplexer-based crossbar (delays in ns)
AMI ULP
Size Rising Falling Avg. Rising Falling Avg.
4x4 0.786 0.896 0.841 1.986 2.694 2.34
8x8 2.61 2.517 2.564 6.602 7.87 7.236
16x16 3.912 3.022 3.467 10.17 8.456 9.313
Figure 4.7 illustrates the behavior of average delay as crossbars become larger. As
expected, the AMI process is faster than the version using ULP transistors. Again, three
points do not provide enough information to extrapolate a best fit curve, but it seems that
delay increases by more than 200% for the 8x8 crossbar with respect to the 4x4 crossbar, but
it only increases by about 50% between the 8x8 crossbar and the 16x16 crossbar.
Interestingly, this behavior of delay with crossbar scaling is very similar to that for the
simple mesh crossbar. From these results, it seems that the addition of one multiplexer on
every data line does not significantly impact average delay. As explained in chapter 3, the
influence of parasitic capacitances seems to be the driving factor in delay across different
crossbar sizes.
Average delay for synthesized crossbar
0
2
4
6
8
10
4x4 8x8 16x16
Crossbar size
Del
ay in
nan
osec
onds
AMIULP
Figure 4.7: Average delay for the synthesized crossbar.
68
For the fabrication processes being compared, wire resistance is negligible compared to
the dynamic resistance of transistors. In fact, fabrication process models developed by the
foundries themselves neglect the effects of wire resistance. In an effort to recognize the
increasing role of interconnect delay in overall circuit performance, table 4.3 lists the longest
paths and their respective wire lengths. Note the effect of scaling in wire length. Doubling
crossbar size increases worst case wire length by roughly 100 μm.
Table 4.3: Worst case wire lengths
Crossbar size Longest path Length of longest path
4x4 d0 to xb2 135 μm
8x8 d0 to xb4 238 μm
16x16 xb0 to d11 340 μm
In the synthesized crossbar, wire length is determined by the place and route tool.
However, it is possible to issue design constraints telling the tool to keep a certain I/O path
under a maximum length. No such design constraints were set for this experiment. Because
all I/O signals must traverse the same kinds of logic gates, and the designer has not singled
out any specific wires through design constraints, it is reasonable to expect I/O wire lengths
to behave according to a Gaussian distribution. This is indeed the case. In the simple mesh,
all I/O connections had different lengths, depending on the position of the crossbar switch.
Meanwhile, most I/O connections in the synthesized crossbar are roughly the same length,
with several wires slightly above or below this average.
4.4 Power Consumption:
Table 4.4 presents power consumption data for the multiplexer-based crossbar. As
expected, the AMI circuit shows virtually no static power consumption. On the other hand,
69
the ULP process shows significant leakage, with static power consumption on the order of
microwatts. Yet, comparing the values for Ipeak across fabrication processes, signal
transitions on the AMI crossbar draw around ten times more current than the corresponding
ULP transitions. In other words, AMI consumes much more dynamic power than ULP, for
a given operating frequency and capacitive load.
Table 4.4: Power consumption summary for synthesized crossbar. Currents in
μA, power in milliwatts.
AMI ULP
Size Ileak Ipeak Static power Ileak Ipeak Static power
4x4 0 3,683 0 26.22 372.7 0.01311
8x8 0 14,040 0 70.7 1,519 0.03535
16x16 0 31,290 0 364.93 4,012 0.1824
Recalling that capacitive load is 0.1 pF per input for our simulations, it is possible to
compute dynamic power consumption as a function of frequency for both fabrication
processes. Furthermore, adding static power consumption to this rate yields total power
consumption as a function of frequency. Figures 4.8, 4.9, and 4.10 show total power
consumption as a function of frequency for the three crossbar sizes. On all of these graphs,
the point of intersection between the two lines corresponds to the minimum frequency at
which ULP consumes less power than AMI.
70
Total power consumption vs. frequency (4x4 crossbar)
02468
10121416
0.1 0.4 0.7 1 1.3 1.6 1.9 2.2 2.5 2.8 3.1
Frequency (MHz)
Pow
er c
onsu
mpt
ion
in
mic
row
atts
AMIULP
Figure 4.8: Power consumption vs. frequency for the 4x4 synthesized crossbar
Total power consumption vs. frequency(8x8 crossbar)
01020304050607080
0.1 0.3 0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9
Frequency (MHz)
Pow
er c
onsu
mpt
ion
in
mic
row
atts
AMIULP
Figure 4.9: Power consumption vs. frequency for the 8x8 synthesized crossbar
71
Total power consumption vs. frequency(16x16 crossbar)
0200400600800
100012001400
0.1 0.3 0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9
Frequency (MHz)
Pow
er c
onsu
mpt
ion
in
mic
row
atts
AMIULP
Figure 4.10: Power consumption vs. frequency for the 16x16 synthesized crossbar ncy for the 16x16 synthesized crossbar
From figure 4.8, the minimum frequency at which the ULP version of the 4x4
synthesized crossbar consumes less power than its AMI counterpart is approximately 3
MHz. This means that for ULP to actually save power over AMI, all inputs to the crossbar
must toggle at an average rate of 3 MHz. In other words, if every input toggles roughly
once every ten clock cycles, the minimum clock frequency for which ULP saves power is 30
MHz.
From figure 4.8, the minimum frequency at which the ULP version of the 4x4
synthesized crossbar consumes less power than its AMI counterpart is approximately 3
MHz. This means that for ULP to actually save power over AMI, all inputs to the crossbar
must toggle at an average rate of 3 MHz. In other words, if every input toggles roughly
once every ten clock cycles, the minimum clock frequency for which ULP saves power is 30
MHz.
For the 8x8 multiplexer-based crossbar, the minimum frequency that yields power
savings for ULP is approximately 0.95 MHz. For the 16x16 crossbar this minimum
frequency drops below 0.3 MHz. As a general tendency, circuits operating at lower
frequencies are more affected by leakage than those that toggle at faster rates. It is also true
that leakage current in ULP is less harmful for larger crossbars than it is for smaller ones.
For the 8x8 multiplexer-based crossbar, the minimum frequency that yields power
savings for ULP is approximately 0.95 MHz. For the 16x16 crossbar this minimum
frequency drops below 0.3 MHz. As a general tendency, circuits operating at lower
frequencies are more affected by leakage than those that toggle at faster rates. It is also true
that leakage current in ULP is less harmful for larger crossbars than it is for smaller ones.
4.5 Other considerations: 4.5 Other considerations:
Table 4.5 presents architectural efficiency for the multiplexer-based crossbar. Table 4.5 presents architectural efficiency for the multiplexer-based crossbar.
72Table 4.5: Architectural efficiency for the synthesized crossbar
Size I/O
connections
Transistor
count
Delay elements per
connection
Utilization Arch. Eff.
4x4 16 248 3 0.333 0.00926
8x8 64 1168 4 0.21 0.00383
16x16 256 5032 5 0.133 0.00089
Although architectural efficiency will be used in chapter 6 as a first order comparison of
crossbar architectures, this parameter does not account for the main advantage of the
multiplexer-based crossbar. The key reason for choosing this approach is the fact that it is
easily synthesizable. Even if the architecture itself is less efficient, the significant reduction
in design time can be used to achieve more thorough optimizations on critical design
criteria. In some ways, this approach exhibits even lower design complexity than the simple
mesh architecture. While the final circuit is certainly more complex than a simple mesh,
most of that complexity is being handled by CAD tools, rather than human designers.
The major drawback of the multiplexer-based architecture is precisely its circuit
complexity. This approach has the highest transistor count for all crossbar sizes, with four
times as many transistors as the simple mesh approach. To keep a reasonable area, logic
density must be much higher for the synthesized crossbar than for other approaches. When
transistors are packed more tightly, a particle hitting an integrated circuit will affect more
transistors. Thus the multiplexer-based crossbar is inherently less radiation tolerant than
other crossbars with similar areas. Moreover, once the circuit is fabricated, it becomes
harder to locate a particular source of failure, because the circuit layout does not reflect the
datapath nature of the crossbar.
Multiplexers and logic gates are not bidirectional. Consequently, the synthesized version
of the crossbars requires an additional input pin to select direction of dataflow. The 4x4
crossbar requires eight address bits to program all its connections. The 8x8 crossbar needs
73
24 address bits, and the 16x16 crossbar needs 64 address bits. The number of programming
bits does not grow as rapidly with crossbar size as it does for the simple mesh.
It must be noted that the synthesized crossbar is less flexible than the simple mesh
approach in terms of direction of dataflow. The simple mesh is truly bidirectional, because
every line can be used as an input or an output independently of other lines. With the
current multiplexer-based design, direction of dataflow is set for all connections in the
crossbar with one signal. Consequently, if d0 is being used as an input, all d lines are being
used as inputs, and all xb lines are being used as outputs.
74
CHAPTER 5: THE BENES NETWORK
5.1 General circuit description
The standard Benes network is described in [1] and [2]. The architecture is based on
butterfly switches. A butterfly switch has two data inputs, two data outputs, and one control
signal. Figure 5.1 illustrates the behavior of a butterfly switch. When S is zero, In0 is
connected to Out0, and In1 is connected to Out1. When S is one, In0 is connected to Out1,
and In1 is connected to Out0.
Figure 5.1: Behavior of the butterfly switch.
Using butterfly switches, it is possible to construct non-blocking networks of virtually
any size. Figure 5.2 illustrates a 4x4 crossbar Benes network. Each block labeled with a
“B” represents a butterfly switch. Each butterfly switch is controlled by an independent
control signal. These control signals have been removed from figure 5.2 for clarity.
Figure 5.2: 4x4 Benes network.
75
Unlike the standard mesh, or the synthesized crossbar, building a 4x4 Benes network
does not provide an intuitive notion on how to build larger crossbars. In general, given a
crossbar size, there are multiple non-blocking Benes networks. The specific architectures
considered in this project use figure 5.2 as their basic building block. Figure 5.3 shows the
8x8 Benes architecture, while figure 5.4 shows the 16x16 version. These are known as
perfect shuffle networks [9].
Figure 5.3: 8x8 Benes network.
76
Figure 5.4: 16x16 Benes network.
Layout for the Benes networks was drawn starting with a custom butterfly switch. These
switches were then tiled and connected as suggested by figures 5.2, 5.3, and 5.4. It is
possible to make a more compact layout by sharing wells and some transistors that become
redundant on larger crossbars. As pointed out in previous chapters, layout area depends
77
partly on subjective decisions made by the circuit designer. Throughout this project, a
conscious effort has been made to provide conservative area estimations. In other words,
area estimates are meant to be worst case values for any 0.6 micron fabrication process.
Figure 5.5 depicts layout for the 4x4 Benes network.
Figure 5.5: Layout for the 4x4 Benes network.
5.2 Transistor Count and Area
A full pass gate implementation of a butterfly switch is shown in figure 5.6. Note that
such a circuit requires both the true and complement forms of the control signal.
Alternatively, NMOS switches may be used in place of the full pass gates.
78
Figure 5.6: Butterfly switch using full pass gates.
As Benes networks grow larger, any given signal must go through more butterfly
switches before reaching an output. In particular, every time crossbar size doubles, two
butterfly switches are added to every possible I/O path. Too many pass gates or pass
transistors in a signal path may lead to degraded signal integrity, slower rise and fall times,
and decreased throughput, making the circuit unsuitable for many applications and
environments. In an effort to alleviate these concerns, a buffer has been added to the outputs
of each butterfly switch. Table 5.1 summarizes transistor count and layout area for the three
crossbar sizes built using Benes networks. Notice that transistor count for the NMOS switch
version is 60% of the transistor count for the equivalent full pass gate circuit. A full-pass-
gate butterfly switch consists of ten transistors. The NMOS butterfly switch uses four pass
transistors and an inverter, for a total of six transistors.
Table 5.1: Transistor Count and Area
Single NMOS switch Full pass gate switch
Crossbar size Transistor count Area (μm2) Transistor count Area (μm2)
4x4 36 11,483 60 23,011
8x8 120 42,472 200 91,882
16x16 336 103,518 560 207,926
79
In terms of measured layout area, the full-pass-gate Benes network is roughly twice as
large as its NMOS counterpart. As with the previous architectures, the most useful aspect of
measuring layout area is looking at scaling trends. Figure 5.7 summarizes area information
from table 5.1.
It is interesting to observe that layout area for the 8x8 Benes network is about four times
larger than that of the 4x4 circuit. However, the 16x16 Benes crossbar is less than three
times larger than the 8x8 network. Remembering figures 5.3 and 5.4, note that doubling
crossbar size effectively doubles the number of butterfly switch rows. In contrast, only two
columns are added every time Benes network size doubles.
Scaling trend for Benes network
0
50,000
100,000
150,000
200,000
250,000
4x4 8x8 16x16
Crossbar size
Layo
ut a
rea
(squ
are
mic
rons
)
Full pass gateNMOS switch
Figure 5.7: Scaling trend for Benes network.
80
5.3 Delay estimations
Tables 5.2 and 5.3 summarize average delays for the Benes networks for both AMI and
ULP transistors. As explained in chapter 3, using only NMOS switches causes a noticeable
discrepancy between rising-transition delay and falling-transition delay.
Table 5.2: Delay summary for Benes network crossbar using NMOS transistors
as switches (delays in nanoseconds).
AMI ULP
Size Rising Falling Avg. Rising Falling Avg.
4x4 2.40 3.24 2.82 3.80 4.47 4.14
8x8 4.45 6.71 5.58 8.46 11.37 9.92
16x16 6.23 8.77 7.50 12.72 17.38 15.05
Table 5.3: Delay summary for Benes network crossbar using full pass gates as
switches (delays in nanoseconds).
AMI ULP
Size Rising Falling Avg. Rising Falling Avg.
4x4 2.07 2.13 2.1 3.1 3.62 3.36
8x8 4.54 4.71 4.63 6.96 7.23 7.1
16x16 6.29 6.75 6.52 9.04 11.17 10.11
Figures 5.8 and 5.9 illustrate the behavior of average delay as crossbars become larger.
As observed for the other crossbar architectures, the AMI process is generally faster than the
81
version using ULP transistors. Again, the circuit using full pass gates is faster than its
NMOS-switch counterpart, but this time the difference is not as dramatic as it was for the
simple mesh. These data seem to indicate that the additional buffers included at the switch
outputs have a greater influence in delay than the switches themselves. It may be
worthwhile to consider a version of the Benes crossbars with fewer output buffers, but such
a design falls outside the scope of this project.
82
Delay for crossbar based on NMOS switches
02468
10121416
4x4 8x8 16x16
Crossbar size
Del
ay in
nan
osec
onds
AMIULP
Figure 5.8: Average delay for the pass transistor version
of the Benes crossbar
Delay for crossbar based on full pass gates
0
2
4
6
8
10
12
4x4 8x8 16x16
Crossbar size
Del
ay in
nan
osec
onds
AMIULP
Figure 5.9: Average delay for the full pass gate version
of the Benes crossbar
83
Benes network, the number of
witches on any given I/O path increases with crossbar size.
t with
e observation that exactly two butterfly switches are being added to each I/O path.
s
xtracted from the layout are smaller for the Benes crossbar than they were for the simple
Although three points do not provide enough information to extrapolate a best fit curve, it
seems that delay increases by about 200% for the 8x8 crossbar with respect to the 4x4
crossbar, but it only increases by approximately 50% between the 8x8 crossbar and the
16x16 crossbar. Such behavior is remarkably similar to that of the simple mesh crossbar,
which seems counterintuitive. I/O paths in the simple mesh always contained only one
switch, regardless of crossbar size. On the other hand, in a
s
Let us consider absolute delay increase, rather than percentage increase. For instance,
Table 5.2 shows that doubling crossbar size, in the ULP simulations for the NMOS switch
version of the Benes crossbar, results in an average delay increase of roughly five
nanoseconds. This observation holds true whether the shift is from a 4x4 network to an 8x8,
or from an 8x8 to a 16x16 crossbar. For the AMI simulations of the same circuit, the
absolute increase in average delay, for doubling crossbar size, is about two nanoseconds.
The same tendency is evident in the full pass gate circuits, where delay increases by two
nanoseconds for the AMI process, and by three nanoseconds for the ULP simulations. The
fact that absolute delay increases linearly, while crossbar size is doubled, is consisten
th
When analyzing the simple mesh crossbar, it was noted that the fact that delay increased
with increasing crossbar size was primarily due to increased parasitic capacitance. In a
Benes network that is not the case. Since crossbar size affects the number of switches that a
signal must go through, the effective resistance of the path is a function of crossbar size.
Furthermore, the number of diffusion capacitances that must be charged varies for different
crossbar sizes. As a result, variations in RC delay are not determined solely by parasitic
capacitances, as was the case with the simple mesh. In fact, average parasitic capacitance
e
84
ther decreasing capacitance. Table 5.4 presents average
capacitance for the Benes crossbar.
Table 5.4: Average parasitic capacitances
mesh. Routing for the simple mesh leads naturally towards long stretches of wires running
parallel to each other. In a Benes network, metal lines must be interleaved to make the
network a fully-connected, non-blocking system. In general, sections of wires running
parallel to each other are shorter in a Benes network than in the simple mesh. Also, to
accommodate the more convoluted patterns, wires are spaced farther apart in the Benes
network than in the simple mesh, fur
for Benes crossbar
Crossbar size Average extracted capacitance
4x4 10.83 fF
8x8 15.3 fF
16x16 19.67 fF
For the fabrication processes presented here, wire resistance is negligible compared to the
dynamic resistance of transistors. In fact, fabrication process models developed by the
foundries themselves neglect the effects of wire resistance. In an effort to recognize the
increasing role of interconnect delay in overall circuit performance, Table 5.5 lists the
ngest paths and their respective wire lengths. It is noteworthy that routing for
op row of
utterfly switches to the bottom row of switches, and then back again to the top.
lo
Benes networks requires signals to traverse several vias, whose resistance might be far
larger than that of wires. Resistance of vias and contacts is neglected entirely by the CAD
tools used for this project. Table 5.5 includes the number of vias on each of the longest
paths. For all Benes crossbars, the longest I/O paths, that go through the most vias, lead
from In0 to Out0. Recall that there are several ways to reach Out0 from In0 in a Benes
crossbar. The worst case path is the one where the signal travels from the t
b
85
: WTable 5.5 orst case wire lengths
Crossbar size Length of longest path Nu er of viasLongest path mb
4x4 In0-Out0 (alt.) 482.1 μm 4
8x8 In0-Out0 (alt) 912.1 μm 12
16x16 In0-Out0 (alt) 1612.8 μm 28
.4 Power consumption
e current, average static power consumption, and
, for all the circuits being compared.
Table MOS
transistors as s. Currents in μA, power in milliwatts.
5
Tables 5.6 and 5.7 summarize leakag
Ipeak
5.6: Power consumption summary for Benes crossbar using N
switche
AMI P UL
Size I I Static wer S leak peak po Ileak Ipeak tatic power
4x4 0 971 0 9.9 106.6 0.00495
8x8 0 1,705 0.00191 27.9 221.2 0.01395
16x16 0 14,990 0.00356 71.8 476 0.03590
86
Table 5.7: Power consumption summary for Benes crossbar using full pass gates
as switches. Currents in μA, power in milliwatts.
AMI ULP
Size Ileak Ipeak Static power Ileak Ipeak Static power
4x4 0 993 0 9.96 106.4 0.00498
8x8 0 1,810 0 14.1 221.3 0.00705
16x16 0 15,370 0 39.7 476.2 0.01985
NMOS implementations have higher leakage than their full pass gate counterparts, and
AMI transistors show virtually no leakage current for either switch implementation. As
explained in chapter 3, static power consumption in NMOS-switch circuits using AMI
transistors is primarily due to degraded voltage levels, which prevent transistors from fully
turning on or off.
Ipeak, on the other hand, remains essentially constant, regardless of the type of switch used
to build the Benes crossbar. This suggests that dynamic power consumption is also
relatively independent of the selected switch. Recalling equation 2.1,
P = C*f*Vdd2*u (2.1)
it follows that the only term affected by the choice of pass switch is capacitance. The load
seen by all crossbar outputs was set to 0.1 picofarads, or 100 femtofarads, as described in
section 2.7. Table 5.4 shows that parasitic capacitances for the Benes crossbar are on the
order of twenty femtofarads. These two parameters remain unaffected by the selection of a
switch. Additionally, all switches drive output buffers. All of these output buffers have the
same gate capacitance, independent of whether the butterfly switch uses a pass gate or a pass
transistor. The only other significant capacitance on the datapath is the diffusion
capacitance of the switch transistors. The data for Ipeak, in tables 5.6 and 5.7 suggests that
87
this capacitance is small compared to the other contributions mentioned above. For the
Benes crossbar layout, diffusion capacitances reported by Smartspice ® are always less than
one femtofarad, which is consistent with the Ipeak data reported in table 5.5.
In agreement with the data from chapter 3, and the nature of ULP, AMI simulations show
much higher values for Ipeak than ULP simulations. Figure 5.10 shows how power
consumption varies with frequency for the 4x4 Benes crossbar built using NMOS transistors
as switches. From the intersection of the AMI and ULP curves, it follows that ULP will
consume less power than AMI as long as inputs toggle at a rate that is faster than 1.2 MHz.
Similar analyses prove that ULP is the more power efficient alternative for all Benes
crossbars operating at more than 2 MHz. Table 5.8 summarizes the minimum frequency at
which ULP becomes viable for the different circuits studied.
Total power consumption vs. frequency (4x4 crossbar)
0
0.002
0.004
0.006
0.008
0.01
0.1 0.3 0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9
Frequency (MHz)
Pow
er c
onsu
mpt
ion
in
mill
iwat
ts
AMIULP
Figure 5.10: Power consumption vs. frequency for the 4x4 Benes crossbar
using NMOS pass transistors.
88Table 5.8: Minimum frequency for ULP power savings (MHz)
Size NMOS switch Full pass gate
4x4 1.91 1.18
8x8 1.42 0.85
16x16 1.2 0.12
5.5 Other considerations:
Table 5.9 presents architectural efficiency for the Benes crossbar.
Table 5.9: Architectural efficiency for the Benes crossbar
Size I/O
connections
Transistor
count
Delay elements per
connection
Utilization Arch. Eff.
4x4 16 60 3 1 0.222
8x8 64 200 5 1 0.08
16x16 256 560 7 1 0.041
For smaller circuits, it has been shown that Benes crossbars are larger, and more
complex, than the alternatives offered in this study. Yet, among the options studied, Benes
networks have the advantage of scaling less dramatically than the simple mesh or the
multiplexer-based crossbar. This means that, for larger crossbars, Benes networks will take
up less area than other crossbar implementations.
When crossbar size is increased, the number of configuration signals necessary to
program the crossbar also increases. As with area, the number of configuration bits scales at
a lower rate for the Benes network than for other crossbar circuits. This advantage might be
exploited in two ways. One possibility is to say that larger Benes crossbars require fewer
89
programming pins than other crossbar circuits of the same size. Alternatively, given a
specific programming interface, with a fixed set of I/O pins, larger Benes networks require
fewer data transfers on any given programming pass.
A reduced number of programming signals does come at a price. Benes networks are not
intuitively programmable, like the simple mesh or the multiplexer crossbar. Every time an
I/O connection is established in a Benes crossbar, it affects other possible routing paths
within the crossbar. As a result, finding the appropriate configuration that realizes all
intended I/O connections without blocking is not a trivial problem. Thus, of the three
alternatives being compared, the Benes network is the only one that requires software
support in its programming. This becomes an issue for systems that require real time
reconfiguration. For systems where configuration and execution are done in separate steps,
it is viable to compute the valid configurations ahead of time, store them in some kind of
memory, and then retrieve the configurations as needed. To aid in configuring Benes
crossbars, figure 5.11 presents an algorithm that takes a set of required I/O connections, and
finds a valid configuration that achieves all connections simultaneously.
90
Figure 5.11: Algorithm for configuring Benes networks.
Configurations for Benes crossbars can be easily visualized by organizing them into a
binary matrix format. Rows and columns in the matrix correspond to rows and columns in
the Benes crossbar. Each entry in the matrix contains the value of the control signal for a
particular butterfly switch, which is identified by its row and column number. Figure 5.12
presents a 4x4 Benes crossbar and its corresponding configuration matrix.
91
Figure 5.12: Benes crossbar and configuration matrix
Using this matrix notation, it is possible to define any set of connections for a particular
crossbar. For example, the two alternative paths that connect X0 to Y0 may be represented
as:
⎥⎦
⎤⎢⎣
⎡xxx000
and ⎥⎦
⎤⎢⎣
⎡xx
x0
11
Note that the control signals marked with the symbol “x” do not affect the connection
between X0 and Y0. Given a set of required I/O connections, it is possible to build an
exhaustive list of configuration matrices that achieve each individual connection. In the
case of a 4x4 crossbar, there are two alternative paths for each I/O pair, and thus two
matrices are formed for each required connection. In general, for an NxN Benes network,
there are exactly N/2 alternative ways of establishing each individual connection.
The goal of programming a Benes network is achieving all the specified connections.
Thus, the algorithm in figure 5.11 assumes that a set of I/O pairs has been defined. On a
fully programmed NxN Benes crossbar, exactly N I/O connections are specified.
It is possible to build a lookup table containing all the alternative ways of establishing
each individual connection. Because each of these N I/O connections can be completed in
N/2 alternative ways, the lookup table resembles an NxN/2 matrix, except that each element
in the lookup table is itself a configuration matrix, rather than just a binary value. Each row
92
in the lookup table identifies a required connection. The columns in the table represent
alternative paths that establish the connection. In other words, each entry in the lookup table
represents a path that, if taken, would provide one of the required connections.
The key to programming a Benes network is finding one configuration matrix that
achieves all required connections simultaneously. This can only be accomplished by
choosing a set of non-conflicting paths for all connections. Two paths are said to be
conflicting if they require any particular butterfly switch to be in both states at the same
time. The algorithm from figure 5.11 essentially traverses the lookup table looking for non-
conflicting paths, until it finds a valid configuration matrix. The algorithm uses two row
pointers and two column pointers. ROW1 and COL1 point to the first path that has been
incorporated into the configuration matrix. ROW2 and COL2 define the latest path being
tested for conflicts with the configuration matrix. The configuration matrix is labeled as
COMPARE in the algorithm.
A match occurs when a new path is found that allows an intended connection without
conflicting with the previously chosen paths. When there is a match, a new configuration
matrix, which incorporates the matching path, must be generated. The new configuration
matrix is the result of intersecting the previous configuration matrix and the new path. Paths
are specified by configuration matrices, whose dimensions are determined by the size of the
Benes crossbar. Intersecting two paths consists of comparing all the corresponding entries
in both configuration matrices. If a particular entry is the same in both matrices, the
resulting matrix keeps the same value for that entry. If one of the two values for a particular
entry is an “x”, but the other one is a one or a zero, the resulting intersection takes the one or
the zero as the value for that entry. If two paths have complementary values on a particular
entry, there is a conflict, and therefore no match has occurred. Equation 5.1 gives an
example of a configuration matrix intersection.
93
)1.5(00
0110
01010 ⎥
⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡∩⎥
⎦
⎤⎢⎣
⎡xxx
xxx
x
Every time a match is found, the current COMPARE matrix, and all the pointers are
pushed onto a stack, before computing the new COMPARE matrix. If COL2 ever reaches
the end of a row (COL2=N/2), it means that a required connection has not been
accomplished. Since all the paths for the latest connection have been tested unsuccessfully,
the only way to find a valid configuration is to backtrack. The algorithm pops old values of
COMPARE and the pointers, increments COL2, and starts looking for new non-conflicting
paths. MCOUNT keeps track of how many intersections have been performed to reach the
current COMPARE matrix. If MCOUNT falls back to zero, it means that the initial path
chosen is creating the conflicts. In that case, COL1 is incremented, so that an alternative
way of achieving the first required connection can be explored. Once ROW2 reaches the
end of the lookup table (ROW2=N), all required connections have been successfully
performed. The resulting configuration matrix is stored in COMPARE.
Although this programming procedure is far more involved than configuring a simple
mesh or a multiplexer-based crossbar, it is actually a rather simple software problem. The
real drawback is not the algorithm’s complexity, but its execution time. Larger lookup
tables take up more memory, and it takes longer to match all the different entries. The
viability of Benes networks as VLSI crossbars rests largely on whether the proposed
application can accommodate the various consequences of this additional programming step,
which is not necessary with other crossbar architectures.
94
CHAPTER 6: ARCHITECTURE COMPARISON
6.1 Area and scalability
Chapter 3 has pointed out several shortcomings of pass transistor implementations, in
terms of their robustness and reliability. Thus, comparisons across different architectures
will focus on circuits that use full pass gates as switches.
Figure 6.1 plots transistor count for the three crossbar architectures being analyzed.
Figure 6.2 provides a similar comparison of crossbar layout area across architectures. For
all the crossbars studied, the simple mesh architecture resulted in the smallest layout area.
Yet, two observations suggest that for larger crossbars the Benes network might be the more
efficient alternative in terms of real estate. First of all, when full pass gates are used as
switches, transistor count for the Benes network is lower than that for the simple mesh.
Moreover, layout area grows more rapidly with crossbar size for the simple mesh than for
the other two approaches. In fact, the rate of increase in layout area with crossbar size is
smallest for the Benes network.
95
Transistor count vs. Crossbar size
0
1000
2000
3000
4000
5000
6000
4x4 8x8 16x16
Crossbar size
Tran
sist
or c
ount
Simple meshMUX basedBenes
Figure 6.1: Transistor count vs. Crossbar size
Layout area vs. Crossbar size
0
50000
100000
150000
200000
250000
300000
4x4 8x8 16x16
Crossbar size
Layo
ut a
rea
in s
quar
e m
icro
ns Simple meshMUX basedBenes
Figure 6.2: Layout area vs. Crossbar size
It is also interesting to note that although the difference in transistor count suggests that
the multiplexer-based crossbar should be much larger than the alternative architectures,
actual layout area shows that is not the case. Layout for the multiplexer-based crossbar is
being generated by a place and route tool. Such CAD tools are able to achieve much higher
96
uch tools are expensive, and
rrently unavailable at the University of Idaho.
.2 Throughput
logic densities than many human layout designers. Crossbar generators can produce layout
that is very area efficient for the simple mesh architecture, but s
cu
6
Figures 6.3 and 6.4 summarize delay measurements observed in this study. As expected,
ULP transistors are slower than their AMI counterparts. More importantly, results show that
the simple mesh crossbars are faster than the alternative architectures. This is also
consistent with theoretical expectations, considering that the only delay element in the
simple mesh datapath is a single pass gate. The comparison between the multiplexer-based
crossbar and the Benes network is less predictable, as it depends on the relative delay
between the multiplexers and logic gates that make up the synthesized circuit, in relation to
the delay of the pass gates and buffers that make up the Benes crossbar.
From the delay data obtained using AMI transistor models, Benes crossbars seem about
two times slower than the corresponding synthesized crossbars. However, for the same
simulations and netlists, the data gathered using ULP transistor models indicate that the
difference in delay is much smaller. In the case of the 8x8 crossbar, average delay is smaller
for the Benes network than for the multiplexer-based crossbar, when only ULP data are
considered.
Recall that transistors were sized for equal rise and fall times considering AMI models.
When the AMI models are substituted with ULP models without resizing the transistors, rise
and fall times may differ significantly. Buffers for the Benes crossbars were built using two
cascaded inverters. Just like the rest of the circuit, these inverters were sized for equal rise
and fall times according to AMI data.
97
Delay is measured at the point where voltages are halfway through their full swing. Note
that, because of the buffers, the Benes network effectively consists of a long chain of
inverters from input to output. Due to the mismatches in rise and fall times for all of these
inverters, measuring delay at the halfway point sometimes yielded negative delays for rising
transitions. The net result of including these negative values into the average is a smaller
figure for average delay.
Note that this behavior does not mean that Benes networks are “faster” for ULP
processes. Simply, delay is measured at an arbitrary point in the signal, and the specific
characteristics of a certain waveform may produce misleading results. In summary, the
average delays reported for the Benes network using ULP transistors should be considered
inaccurate. They are only presented here for completeness. Section 6.8 discusses
alternatives for obtaining more useful data for comparison. Overall, Benes networks are
slower than the other two crossbar architectures.
Signal delay vs. Crossbar size(AMI transistors)
01234567
4x4 8x8 16x16
Del
ay in
nan
osec
onds
Simple meshMUX basedBenes
Crossbar size
Figure 6.3: Signal delay for crossbars using AMI transistor models.
98
Signal delay vs. Crossbar size (ULP transistors)
0
2
4
6
8
10
12
4x4 8x8 16x16
Crossbar size
Del
ay in
nan
osec
onds
Simple meshMUX basedBenes
Figure 6.4: Signal delay for crossbars using ULP transistor models.
Dynamic simulation results indicate that Benes crossbars are much slower than other
architectures, as long as gate delay dominates over interconnect delay. To understand what
happens in deep submicron technologies, where interconnect delay should not be neglected,
figure 6.5 summarizes worst case wire length for the different crossbars.
Worst case wire length
0200400600800
10001200140016001800
4x4 8x8 16x16
Crossbar size
Wire
leng
th in
mic
rons
Simple meshMUX basedBenes
Figure 6.5: Worst case wire length.
99
Again, the Benes network proves to be the slowest architecture. Worst case wire lengths
are far greater for Benes networks than for other architectures. Furthermore, the convoluted
routing of Benes crossbars forces signals to travel through more vias than the simple mesh,
or the multiplexer-based crossbar. As a result, Benes networks will remain the slowest
crossbar alternative, even in deep submicron technologies. Notice however what happens
for the multiplexer-based crossbar. Because this is, essentially, a combinational circuit that
adds one level of logic every time crossbar size doubles, the physical distance between
inputs and outputs only increases as much as is necessary to accommodate the additional
level of logic. As the paradigm shifts from gate dominated delay to interconnect dominated
delay, the combinational approach to crossbars might yield throughputs comparable to those
of the simple mesh.
6.3 Power consumption
One of the main issues driving the interest in Benes crossbars for VLSI is its limited
static power consumption. There is concern that highly parallel structures, such as simple
mesh crossbars, might yield unacceptable static power consumption, due to high
leakage currents. Using full pass gates, rather than NMOS switches, essentially eliminates
all static power consumption from AMI crossbars. Figure 6.6 illustrates static power
consumption for ULP crossbars using full pass gates.
100
Static power consumption for crossbars using ULP transistor models
0
0.05
0.1
0.15
0.2
4x4 8x8 16x16
Crossbar size
Pow
er c
onsu
mpt
ion
in
mill
iwat
ts Simple meshMUX basedBenes
Figure 6.6: Static power consumption for crossbars using ULP transistor models.
For the crossbar circuits simulated, as long as all inputs toggled at an average rate of at
least 3 MHz, ULP simulations yielded power savings, when compared with AMI
simulations. Thus, static power consumption and leakage currents should only be cause for
concern where the average toggling rate of the inputs is less than 3 MHz.
6.4 Architectural efficiency and programming complexity
Architectural efficiency attempts to quantify the relationship between the flexibility
offered by a particular crossbar architecture, and the amount of hardware needed to achieve
that flexibility. A higher architectural efficiency means that a given architecture provides
more alternative connections, using fewer switches (or multiplexers), than another
architecture with a lower efficiency. Architectural efficiency is a very useful figure of merit
in comparing routing schemes for computer networks. In the case of
101
VLSI crossbars, it is important to remember that architectural efficiency does not account
for all the electrical considerations discussed in the previous sections of this chapter. Figure
6.7 graphs architectural efficiency for the circuit alternatives considered in this study.
Architectural efficiency of crossbar alternatives
0
0.05
0.1
0.15
0.2
0.25
4x4 8x8 16x16
Crossbar size
Arc
hite
ctur
al e
ffici
ency
Simple meshMUX basedBenes
Figure 6.7: Architectural efficiency for different crossbars.
Notice that architectural efficiency decreases with increasing crossbar size. This trend is
consistent with the observation that, although small crossbars are attractive circuits for their
simplicity and versatility, adding crossbar inputs often results in very large circuits, which
require programming too many switches. The Benes network has the highest architectural
efficiency precisely because it uses the fewest total number of switches to construct a fully-
connected, non-blocking crossbar. Furthermore, given a certain crossbar configuration, the
Benes network uses all switches to establish the required connections. In contrast, when
considering a simple mesh crossbar, for every established connection in an NxN crossbar,
there are N-1 switches that must remain turned off to avoid driving conflicts. Because
multiplexers are not bidirectional, on the synthesized crossbar half the circuit is turned off at
any given time. Moreover, the multiplexer-based crossbar presented here has a global
direction signal, so it is not truly bidirectional, further decreasing its architectural efficiency.
102
Even though the Benes network is the more efficient architecture, its advantages are better
suited for computer networks than for hardware crossbars. For example, architectural
efficiency does not take into account the fact that programming a Benes network for a
specific set of connections is not a trivial task. The additional complexity of programming a
Benes network might not be a serious burden for a computer network. In such sytems, there
is a versatile processor that can handle the routing algorithm, either on a server or on the
workstations themselves. In the case of embedded systems or ASICs, where computing
power is optimized for a particular application, accommodating a routing algorithm for the
crossbars might be unacceptable, or even infeasible. Moreover, in a computer network, the
hardware cost of idle switches, such as those found in the simple mesh, might make the
Benes network a more attractive alternative. Larger systems may be inclined to use Benes
networks because these scale less rapidly than the other crossbars, but the price to be paid is
lower throughput and increased programming complexity.
6.5 Application scenarios
Performance of a particular circuit is closely linked to its application. Likewise, the
choice of a certain circuit design is often dependent on the environment the circuit must
withstand, i.e. a radiation intensive environment, extreme temperatures, etc. Rather than
reaching a general conclusion about which crossbar architecture is better, this study looks
for the strengths and weaknesses of each architecture, and then outlines a context where
each crossbar design might be useful.
Overall, the simple mesh offers a straightforward, versatile, and fast architecture to build
a crossbar. For ULP transistors, simulation data have shown that static power consumption
is not a major concern for crossbars of size 16x16, or smaller, as long as inputs toggle at an
average rate of more than 3 MHz. Thus, the only major disadvantage of the simple mesh is
its scaling trend. For most crossbars larger than the ones examined in this project, the
103
simple mesh is likely to be the architecture that takes up the largest silicon area.
Additionally, the number of independent programming signals required for the single mesh
grows at the same exponential rate as layout area. In summary, the simple mesh is the ideal
architecture for small crossbars. For systems looking for power-consumption efficiency,
such as battery operated systems, whose most extreme subset includes space-borne
applications, it is important to estimate the activity rate of the crossbar signals. If signals
show low activity rates, the advantages of a ULP fabrication process might be
overshadowed by static power consumption. If that happens, it may be better to switch to a
less expensive fabrication process, or choose a different crossbar architecture. The simple
mesh is also the most robust and reliable of crossbar architectures. Because it has fewer
transistors, it is less likely that a defect in the silicon will affect circuit operation. By the
same token, it is less probable that a radiation particle will hit a critical transistor in an
architecture with such a low transistor density.
The Benes network can be the most energy efficient crossbar architecture, in applications
where signals have a low activity rate. Yet, if a crossbar has a low activity rate, it is likely
that its contribution to system power consumption is small, compared to the rest of the
circuit. Thus, Benes networks would only be recommended for very large crossbars, where
they would yield the smallest layout area. However, the price to be paid for the smaller
circuit is a significant increase in signal delay, which adversely affects throughput. In any
case, Benes networks are only viable in systems that can efficiently deal with the issue of
programming the Benes crossbar. In general, Benes architectures are best suited for
computer networks. In VLSI systems, Benes crossbars may find a niche in applications
requiring extremely large interconnection networks, but even then, such crossbars might
prove to be too slow for the intended application. Furthermore, the need to use more vias in
the routing of Benes networks is a strong point against the architecture’s reliability.
Electromigration, stress voiding, and even mechanical breaking of a contact are more likely
to occur in vias than in metal layers. The skin effect on current flow effectively increases
current density across contacts and vias. Adding more contacts alleviates this problem, but
104
it also leads to larger circuits, diminishing the advantage of using Benes networks for larger
crossbars. Finally, the multiplexer-based crossbar is, electrically, the least efficient of all
crossbar architectures. Circuits are large in area, and also have much higher transistor
counts.
Because they are logically denser than other architectures, they will tend to be less
reliable than the other architectures. It is also the least energy efficient architecture. For
current fabrication technologies, the multiplexer-based crossbar is slower than the simple
mesh. However, it should also be pointed out that the synthesized crossbar also yielded the
shortest wire lengths for the datapath. As technology shifts to deep submicron, with a
dominance of interconnect delay over gate delay, a combinational crossbar could become
faster than a simple mesh. Aside from this speed conjecture, the major advantage of a
synthesized crossbar is that it has the lowest design time. Although a simple mesh is
extremely simple, it is still necessary to lay it out, or else use expensive crossbar generators.
Standard synthesizers and place and route tools generate multiplexer-based crossbars at a
fraction of the cost, using a very simple HDL description as their starting point. For this
reason, synthesized crossbars are the preferred alternative for proof of concept in a larger
system.
6.6 Conclusions and future work
• Overall, the simple mesh offers the most straightforward, versatile, and fastest
architecture to build a crossbar for VLSI systems. Its only drawback is the fact that the
circuit grows proportionally to N2, for NxN crossbars.
• In general, Benes architectures are best suited for computer networks, where the issue of
network reconfiguration can be dealt with more efficiently in software.
105• Combinational circuits that serve as crossbars might become the fastest alternative
for deep submicron technologies. For now, these synthesized crossbars belong in proof-
of-concept prototypes, where the focus is on functionally correct circuits with very short
design times.
• Static power consumption, for crossbars built with the ULP process, is only an issue for
circuits operating at frequencies below 10 MHz.
• Wire resistance models posed a severe limitation for this study. A more accurate study
can be developed if simulations include the effects of wire resistance. The effect of vias
should also be quantified as part of such a follow-up study.
• The effect of varying transistor sizes was not explored in this project. Although the
general conclusions provided here are probably still valid across different transistor
sizes, a more thorough comparison would include such variations.
• For a more formal comparison of crossbar architectures it would be important to
establish a set of benchmark tests, to be performed in simulation or on actual circuitry.
Although all crossbars in this study went through the exact same tests, there is no
guarantee that the chosen input stimuli are representative of real life crossbar
applications.
• Two variables that might affect crossbar behavior but were not explored thoroughly in
this project are temperature and operating frequency. To provide accurate quantitative
results, crossbars should be tested under a variety of conditions for these two variables.
• It would be interesting to reevaluate an alternative design of the Benes crossbar, using
fewer buffers between butterfly switches.
106• A true reliability study can only be performed based on actual circuits, through
burnout experiments. Reliability observations in this report are merely hypothesis that
stem from theoretical notions and previous experience with circuits other than crossbars.
• A quantitative study requires a more thorough statistical analysis of all data. However,
the length and scope of such a study only makes sense for actual, physical circuits.
Because the simulations examined in this project are crude approximations of reality, a
complete statistical analysis would not yield quantitatively accurate
information. Thus, this study focuses on qualitative assessments, supported by
theoretical background and simulation results.
• For a complete study on crossbar architectures and their applications, it would be
necessary to fabricate these circuits, and perform a statistical analysis over many copies
of every particular chip. Only then can quantitative data be considered reliable.
107
REFERENCES
[1] V.E. Benes, Mathematical Theory of Connecting Networks and Telephone Traffic,
Academic Press, 1965.
[2] D. Bhatia and J. Haralambides, “Resource requirements for Field Programmable
Interconnection Chips”, IEEE Trans. VLSI Syst., Vol. 8, No. 3, pp. 346-355, June 2000.
[3] J.B. Burr, and A.M. Peerson, “Ultra Low Power CMOS Technology”, NASA Symposium
on VLSI Design, pp. 4.2.1-4.2.13, 1991.
[4] C-K. Cheng, J. Lillis, S. Lin, N. Chang, Interconnect analysis and synthesis, John Wiley
& Sons, Inc. 2000.
[5] P. Chow, S.O. Seo, J. Rose, K. Chung, G. Paez-Monzon, and I. Rahardja, “The design
of an SRAM-based field-programmable gate array-Part I: Architecture,” IEEE Trans. VLSI
Syst., Vol. 7, No.2, pp. 191-197, June 1999.
[6] P. Chow, S.O. Seo, J. Rose, K. Chung, G. Paez-Monzon, and I. Rahardja, “The design
of an SRAM-based field-programmable gate array-Part II: Circuit Design and Layout,”
IEEE Trans. VLSI Syst., Vol. 7, No. 3, pp. 321-330, September 1999.
[7] C. Clos, “A Study of Non-Blocking Switching Networks”, Bell System Tech Journal,
Vol. 32, pp. 406-424, March 1953.
[8] A. DeHon, "Reconfigurable Architectures for General-Purpose Computing",
Massachusetts Institute of Technology Artificial Intelligence Laboratory Report No. 1586,
sponsored under contract F30602-94-C-0252, October 1996.
108
[9] J. Duato, S. Yalamanchili, and L. Ni, Interconnection Networks: an Engineering
Approach, IEEE Computer Society Press, 1997.
[10] A. Ejnioui, and N. Ranganathan, “Multiterminal Net Routing for Partial Crossbar-
Based Multi-FPGA Systems”, IEEE Trans. VLSI Syst., Vol. 11, No. 1, pp. 71-78, February
2003.
[11] V.C. Gaudet, R.J. Gaudet, and P.G. Gulak, “Programmable Interleaver Design for
Analog Iterative Decoders”, IEEE Trans. on Circuits and Systems-II: Analog and Digital
Signal Processing, Vol. 49, No. 7, pp. 457-464, July 2002.
[12] G. Han, R.H. Klenke, and J.H. Aylor, “Performance Modeling of Hierarchical
Crossbar-Based Multicomputer Systems”, IEEE Trans. on Computers, Vol. 50, No. 9, pp.
877-890, September 2001.
[13] D. Harris. Skew-Tolerant Circuit Design. Morgan-Kaufmann Publishers. San
Francisco, California. 2001.
[14] K.J. Hass, J. Venbrux and P. Bhatia, “Logic Design Considerations for 0.5-Volt
CMOS”, 2001 Conference on Advanced Research in VLSI, March 14-16, 2001.
[15] M.A.S. Khalid, and J. Rose, “A Novel and Efficient Routing Architecture for Multi-
FPGA Systems”, IEEE Trans. VLSI Syst., Vol. 8, No. 1, pp. 30-39, February 2000.
[16] Y-T. Lau, and P-T. Wang, “Hierarchical Interconnection Structures for Field
Programmable Gate Arrays”, IEEE Trans. VLSI Systems, Vol. 5, No. 2, June 1977.
[17] J. Li and C-K. Cheng, “Routability Improvement Using Dynamic Interconnect
Architecture”, IEEE Trans. VLSI Syst., vol. 6, pp. 498-501, September 1998.
109
[18] J.M. Rabaey, Digital Integrated Circuits: A Design Perspective. Prentice Hall
Electronics and VLSI series. Prentice Hall, Inc. 1996.
[19] Space Radiation Effects Handbook. Space Electronics Inc. and Full Circle Research.
December, 1997.
[20] I. Sutherland et al, Logical effort: Designing fast CMOS circuits. Morgan-Kaufmann
Publishers. San Francisco, California. 1999.
[21] H. Zhang, M. Wan, G. Varghese, J. Rabaey, “Interconnect Architecture Exploration for
Low-Energy Reconfigurable Single-Chip DSPs”, Proc. IEEE Computer Society Workshop
on VLSI ’99, 8-9 April 1999, pp. 2-8.
[22] Y. Zhang, and M. J. Irwin, “Power and Performance Comparison of Crossbars and
Busses as On-Chip Interconnect Structures”, Proc. 33rd Asilomar Conf. on Signals, Systems
and Computers, Oct. 24-27, 1999, pp. 378-383.