sikder, md ashif accepted thesis 7-20-16 su16
TRANSCRIPT
Emerging Technologies in On-Chip and Off-Chip Interconnection Network
A thesis presented to
the faculty of
the Russ College of Engineering and Technology of Ohio University
In partial fulfillment
of the requirements for the degree
Master of Science
Md Ashif Iqbal Sikder
August 2016
© 2016 Md Ashif Iqbal Sikder. All Rights Reserved.
2
This thesis titled
Emerging Technologies in On-Chip and Off-Chip Interconnection Network
by
MD ASHIF IQBAL SIKDER
has been approved for
the School of Electrical Engineering and Computer Science
and the Russ College of Engineering and Technology by
Avinash Karanth Kodi
Associate Professor of Electrical Engineering and Computer Science
Dennis Irwin
Dean, Russ College of Engineering and Technology
3
Abstract
SIKDER, MD ASHIF IQBAL, M.S., August 2016, Electrical Engineering
Emerging Technologies in On-Chip and Off-Chip Interconnection Network (80 pp.)
Director of Thesis: Avinash Karanth Kodi
The number of processing cores on a chip is increasing with the scaling down of
transistors to meet the computation demand. This increase requires a scalable and an energy
and latency efficient network to provide a reliable communication between the cores.
Traditionally, metallic interconnection networks are used to connect the cores. However,
according to the International Technology Roadmap for Semiconductor (ITRS), metallic
interconnection networks would not be able to meet the future on-chip communication
demands due to the energy and latency constraints. Thus, this thesis focuses on the novel
on-chip network designs employing the emerging technologies, such as wireless and optics,
to provide a scalable and an energy and latency efficient network. In this thesis, I propose
an on-chip network architecture called Optical and Wireless Network-on-Chip (OWN)
and extend OWN to construct Reconfigurable Optical and Wireless Network-on-Chip (R-
OWN) architecture. OWN and R-OWN both leverage the advantages of optics and wireless
technologies while circumventing the limitations of these technologies. The end result is
that OWN and R-OWN both can provide a maximum of three hops communication between
any two cores for a 256 to 1024 core networks. My simulation results with synthetic
traffic demonstrate that, for 1024-core architectures, OWN requires 34% more area than
hybrid-wireless architectures and 35% less area than hybrid-photonic architectures [1].
In addition, OWN consumes 30% less energy per bit than hybrid-wireless architectures
and 14% more energy per bit than hybrid-photonic architectures [1]. Moreover, OWN
shows 8% and 28% improvement in saturation throughput compared to hybrid-wireless
and metallic architectures respectively [1]. On the other hand, for 256-core architectures,
R-OWN requires 3.9% and 12% less area compared to metallic and hybrid-wireless
4
architectures respectively. Additionally, R-OWN consumes 44% and 50% less energy per
bit compared to metallic and hybrid-wireless architectures respectively. Furthermore, R-
OWN shows saturation throughput that is 27% and 31% higher than hybrid-wireless and
metallic architectures respectively.
Since the number of memory intensive applications is increasing, similar to on-chip
communication off-chip memory access is also becoming important. A metallic link is gen-
erally used to connect the on-chip components to the off-chip memory element. Because
wireless technology shows a better energy efficiency and latency requirement compared to
the metallic technology for longer distances, in this thesis, I propose several hybrid-wireless
networks to explore the use of wireless technology, as an alternative to the metallic technol-
ogy, for off-chip memory access. My proposed networks require a maximum of two hops
to access the off-chip memory and also significantly reduce both the application execution
time and energy per bit for real traffic. My simulation results show that, for a 16-core net-
work, the on-chip and off-chip wireless network requires 11% less execution time and also
consumes approximately 79% less energy per packet compared to the baseline metallic ar-
chitecture.
5
Acknowledgements
First, I would like to thank my parents for always supporting me. Second, I would like
to thank my supervisor Dr. Avinash Kodi, for relentlessly pushing me. Third, I would like
to thank my committee members- Dr. Savas Kaya, Dr. Jeffrey Dill, and Dr. David Ingram
for their valuable time. Lastly, I would like to thank NSF as this thesis work was partially
supported by NSF grants CCF-1054339 (CAREER), CCF-1420718, CCF-1318981, ECCS-
1342657, and CCF-1513606.
6
Table of Contents
Page
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
List of Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.1 Network-on-Chip (NoC) . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.2 Issues in NoC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2.1 Energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.2.2 Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161.2.3 Metallic Interconnects . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3 Emerging Technologies in Interconnection Network: Wireless and Photonics 171.3.1 Wireless Interconnection Network . . . . . . . . . . . . . . . . . . 171.3.2 Photonic Interconnection Network . . . . . . . . . . . . . . . . . . 20
1.4 Proposed Research and Major Contributions . . . . . . . . . . . . . . . . . 221.4.1 Heterogeneity in Interconnection Network . . . . . . . . . . . . . . 231.4.2 Off-Chip Interconnection Network . . . . . . . . . . . . . . . . . . 241.4.3 Key Contributions and Thesis Organization . . . . . . . . . . . . . 25
2 Heterogeneous Network-on-Chip . . . . . . . . . . . . . . . . . . . . . . . . . . 262.1 OWN Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.1.1 64-Core OWN Architecture: Cluster . . . . . . . . . . . . . . . . . 282.1.2 1024-Core OWN Architecture: Cluster and Group . . . . . . . . . 292.1.3 Intra-Group and Inter-Group Communication . . . . . . . . . . . . 322.1.4 Deadlock Free Routing . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2 Technology for OWN: Wireless and Optical . . . . . . . . . . . . . . . . . 352.2.1 Wireless Technology . . . . . . . . . . . . . . . . . . . . . . . . . 352.2.2 Photonics Technology . . . . . . . . . . . . . . . . . . . . . . . . 37
2.3 Reconfigurable-OWN (R-OWN) . . . . . . . . . . . . . . . . . . . . . . . 382.3.1 256-Core OWN Architecture . . . . . . . . . . . . . . . . . . . . . 382.3.2 256-Core R-OWN Architecture . . . . . . . . . . . . . . . . . . . 402.3.3 Routing Mechanism of 256-Core R-OWN . . . . . . . . . . . . . . 422.3.4 Deadlock Free Routing . . . . . . . . . . . . . . . . . . . . . . . . 44
7
3 Off-Chip Interconnection Network . . . . . . . . . . . . . . . . . . . . . . . . . 463.1 On-Chip and Off-Chip Wireless Architecture . . . . . . . . . . . . . . . . 47
3.1.1 Metallic Interconnects (M-M-X-X) . . . . . . . . . . . . . . . . . 493.1.2 Hybrid Wireless Interconnect (W/M-W/M-X-X) . . . . . . . . . . 49
3.1.2.1 On-Chip Hybrid Wireless Interconnect (W-M-X-X) . . . 493.1.2.2 Off-Chip Hybrid Wireless Interconnect (M-W-X-X) . . . 523.1.2.3 On-Chip and Off-Chip Hybrid Wireless Interconnect
(W-W-X-X) . . . . . . . . . . . . . . . . . . . . . . . . 523.2 Communication Protocol: Metallic and Hybrid Wireless Interconnect . . . 54
3.2.1 On-Chip Metallic and Off-Chip Metallic or Wireless Interconnects . 543.2.2 On-Chip Wireless Interconnects With Omnidirectional Antenna
and Off-Chip Metallic Interconnects . . . . . . . . . . . . . . . . . 563.2.3 On-Chip Wireless Interconnects With Directional Antenna and
Off-Chip Metallic Interconnects . . . . . . . . . . . . . . . . . . . 57
4 Evaluation of the Proposed Architectures . . . . . . . . . . . . . . . . . . . . . . 584.1 Performance Evaluation of OWN . . . . . . . . . . . . . . . . . . . . . . . 59
4.1.1 Area Estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594.1.2 Energy Estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . 604.1.3 Saturation Throughput and Latency Comparison . . . . . . . . . . 62
4.2 Performance Evaluation of R-OWN . . . . . . . . . . . . . . . . . . . . . 644.2.1 Area Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 654.2.2 Energy Estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . 664.2.3 Saturation Throughput and Latency Comparison . . . . . . . . . . 67
4.3 Performance Evaluation of On-Chip and Off-Chip Wireless Network . . . . 704.3.1 Execution Time Estimate . . . . . . . . . . . . . . . . . . . . . . . 704.3.2 Energy per Byte Estimate . . . . . . . . . . . . . . . . . . . . . . 71
5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
8
List of Tables
Table Page
2.1 Optical device parameters [1]© 2015 IEEE. . . . . . . . . . . . . . . . . . . . 37
3.1 Naming convention of the baseline and proposed on-chip and off-chip wirelessarchitectures [2]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2 Summary of the baseline and proposed on-chip and off-chip wireless architec-tures [2]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.1 Simulation parameters for the baseline and proposed on-chip and off-chipwireless architectures [2]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
9
List of Figures
Figure Page
1.1 General purpose processor trend-line. . . . . . . . . . . . . . . . . . . . . . . 121.2 An example of on-chip mesh network. . . . . . . . . . . . . . . . . . . . . . . 131.3 Layout and physical structure with addressing of a WCube [3] © ACM DOI
10.1145/1614320.1614345. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181.4 Architecture of a small-world [4] © 2011 IEEE and a iWISE [5] network ©
2011 IEEE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191.5 256-core Firefly architecture [6]© ACM DOI 10.1145/1555754.1555808. . . . 211.6 1024-core ATAC architecture [7]© ACM DOI 10.1145/1854273.1854332. . . 22
2.1 64-core OWN architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.2 Overview of a 1024-core OWN architecture [1]© 2015 IEEE. . . . . . . . . . 292.3 Kilo-core OWN architecture [1]© 2015 IEEE. . . . . . . . . . . . . . . . . . 312.4 Communication mechanism of a 1024-core OWN architecture [1]© 2015 IEEE. 332.5 Deadlock scenarios in a 1024-core OWN [1]© 2015 IEEE. . . . . . . . . . . . 352.6 256-core OWN architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . 392.7 Structure of 256-core R-OWN and a wireless router [8]. . . . . . . . . . . . . . 402.8 Communication mechanism of a 256-core R-OWN [8]. . . . . . . . . . . . . . 432.9 Deadlock scenarios in a 256-core R-OWN with a deadlock avoidance
technique [8]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.1 General structure of the baseline and proposed off-chip wireless architectures[2]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2 General structure of the proposed on-chip and off-chip wireless architectures [2]. 503.3 Communication mechanism of the proposed hybrid-wireless architectures [2]. . 55
4.1 Evaluation of OWN’s area requirement [1]© 2015 IEEE. . . . . . . . . . . . . 594.2 Evaluation of OWN’s energy requirement [1]© 2015 IEEE. . . . . . . . . . . 614.3 Evaluation of OWN’s latency requirement [1]© 2015 IEEE. . . . . . . . . . . 634.4 Evaluation of OWN’s saturation throughput [1]© 2015 IEEE. . . . . . . . . . 644.5 Evaluation of R-OWN’s area requirement. . . . . . . . . . . . . . . . . . . . . 654.6 Evaluation of R-OWN’s energy requirement. . . . . . . . . . . . . . . . . . . . 674.7 Evaluation of R-OWN’s latency requirement. . . . . . . . . . . . . . . . . . . 684.8 Evaluation of R-OWN’s saturation throughput. . . . . . . . . . . . . . . . . . 694.9 Execution time estimate of the hybrid-wireless architectures [2]. . . . . . . . . 724.10 Energy per byte comparison of the baseline and the proposed hybrid-wireless
architectures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
10
List of AcronymsChip Multiprocessor CMPNetwork-on-Chip NoCOn-Chip Network OCNInstruction Level Parallelism ILPInstructions Per Cycle IPCDynamic Random Access Memory DRAMComplementary Metal Oxide Semiconductor CMOSMetal Oxide Field Effect Transistor MOSFETFin Field Effect Transistor FinFETTime Division Multiplexing TDMFrequency Division Multiplexing FDMCode Division Multiplexing CDMSpace Division Multiplexing SDMWavelength Division Multiplexing WDMDense Wavelength Division Multiplexing DWDMInternational Technology Roadmap for Semiconductors ITRSWireless Network-on-Chip WiNoCDimension Order Routing DORSingle-chip Cloud Computer SCCMulti-Purpose Processor Array MPPADynamic Voltage and Frequency Scaling DVFSSingle Write Multiple Read SWMRMultiple Write Single Read MWSRVirtual Channel VCGiga bit per second GbpsRadio Frequency RFMiss Status Hold Register MSHRMicro Ring Resonator MRRMicro Wireless Router MWRCarbon Nanotube CNTNetwork Interface Controller NICDouble Data Rate DDRLow Voltage Technology LVTUniform Normal UNBit Reversal BRPerfect Shuffle PSNeighbor NBRComplementary COMPMatrix Transpose MTButterfly BFLYPrinceton Application Repository for Shared-Memory Computers PARSEC
11
1 Introduction
In the last decade of the twentieth century, the performance of microprocessors,
following Moores law, continued to increase by using instruction level parallelism (ILP),
using faster clock frequency, and incrementing the number of transistors [9]. However, near
the beginning of the twenty-first century, as the processors issued multiple instructions
per cycle (IPC), marginal performance gains were achieved from ILP. Moreover, since
the dynamic power is directly proportional to frequency, the microprocessor clock
frequency could not be increased indefinitely. Thus, with the scaling down of transistors,
computer architects continued to add more transistors to achieve higher performance
gains. Nevertheless, the transistor power requirement was reduced with each process
generation, but accommodating a myriad number of transistors on a single chip increased
the total power consumption to a level where the chip and thermal management became
complex and insurmountable [10]. Therefore, the industry shifted from uniprocessor to
multiprocessor design, namely Chip Multiprocessor (CMP). As the name suggests, a CMP
is a collection of simple uniprocessors (processing core or simply core) integrated into
a single chip so that they can share the workload. As a result, a single, large complex
processor is replaced by several small simple processors to boost the performance [10].
The cores of a CMP may frequently need to communicate with each other to execute an
application or multiple applications. The simplest communication network in CMP is the
shared single bus that consists of a set of parallel wires to which various components are
connected. As the connected components share the bus, only one of them can transmit at a
time which limits the performance and increases communication delay. In addition, Figure
1.1 shows that as the number of cores are exponentially increasing to satisfy application
requirements, a bus-based communication system is clearly not scalable to accommodate
0 Some material of this thesis was used verbatim from my publication [1] with permission© 2015 IEEEand two publications- [8] and [2] accepted but not published at the time of this thesis submission.
12
Figure 1.1: General purpose processor clock frequency and number of on-chip processingcores over time and their estimated trend-line [12], [13], [14], [15], [16], [17], [18], [19],[20], [21], [22], [23], [24], [15], [16], [25], [26], [27].
the large number of cores on the chip. Thus, future multi-core processors require an on-chip
communication fabric that is scalable, modular, and provides the desired performance even
with hundreds to thousands of components (processing cores and caches) in an energy-
efficient way. This communication fabric is called Network-on-Chip (NoC) [11].
1.1 Network-on-Chip (NoC)
Network-on-Chip (NoC), also known as On-Chip Network (OCN), is an interconnec-
tion network that establishes connections between many components such as memory ele-
ments, registers, and processing cores residing on a single chip [11]. One such network is
shown in Figure 1.2 which consists of routers, cores, caches (L1, L2), memory controllers,
and interconnection links. Each router is connected with one or more processing cores and
13
Figure 1.2: An example of 16 core 4×4 Mesh Network-on-Chip (NoC). It contains routersas network interface, processing cores, on-chip memory elements (Level 1 and Level 2cache), and memory controller (MC) to access off-chip memory, DRAM.
usually multiple on-chip memories. Routers are also connected to each other through inter-
connection links. Routers work as the networks entrance and exit points for the cores and
memory elements. Any core or memory element that needs to send a packet will utilize
the adjacent router connected to that element to send the packet to the destination router.
Some of the routers are connected to memory controllers for off-chip memory, Dynamic
Random Access Memory (DRAM), access. Memory controllers are connected to the off-
chip memory modules via metallic links, and they connect the on-chip memory elements
to the off-chip DRAM.
Network-on-Chip (NoC) is the backbone of many-core computing system which
ensures proper transmission of messages between the on-chip components. A message
can be sent as a whole packet or broken into several smaller packets before transmission.
Sending a message as several smaller packets via packet switching is faster and more
efficient than driving large number of wires and, therefore, more prominently used than
circuit switching [11]. In a packet switching network, packets can be routed such that
14
the path requires least number of hops (minimal path) or is least congested (non-minimal
path) or a combination of both [11]. One popular routing method for the mesh network,
commonly used for NoCs and shown in Figure 1.2, is dimension order routing (DOR) [11].
Packets following DOR protocol may go in the X direction first, and then the Y direction
or vice versa to ensure a deadlock free routing. Since the mesh topology is easy to fabricate
and the DOR routing mechanism is easy to follow, DOR based mesh network is very
common. Some other common topologies include- torus, flatten butterfly, and concentrated
mesh [28]. Some of these topologies are shown below with several commercial processor
examples.
There are some commercial prototypes available that have implemented NoC as the
communication paradigm for many-core processors. For example, Intel Single-chip Cloud
Computer (SCC) has integrated 48 cores into a silicon chip and is intended to increase the
core counts to 100 and beyond. Intel SCC is divided into tiles where each tile contains
two cores, and the tiles are connected as a 2D mesh network [29]. Continuing the tile-
based approach, Intel Teraflops was presented which is the first programmable chip that
can compute one trillion mathematical calculations per second consuming only 62W. Intel
Teraflops contains 80 simple cores that are connected as a 2D mesh network [30]. Another
processor manufacturing company, EZchip, announced the first 100-core 64-bit processor
called Tile-Mx100. This processor uses ARMv8 core and a 2D mesh network to connect the
cores [31]. However, there are some drawbacks of a 2D mesh network such as congestion
at the center routers due to XY routing and large delay when the number of cores increases
due to additional hops. As a result, Kalray designed a MPPA (Multi-Purpose Processor
Array) with a 2D wrapped around torus NoC architecture, and the MPPA roadmap features
64 to 1024 cores on a single chip [32].
15
1.2 Issues in NoC
Traditional NoC designs are predominantly metallic 2D mesh or torus. With the
increasing number of cores, multi-hop communication and routing complexity increases
which significantly impacts the overall performance of the NoC due to high latency and
energy consumption [5]. In the following sub-sections the primary issues of NoC such as
energy, latency, and limitations of metallic interconnects are discussed.
1.2.1 Energy
The increase in the number of processing cores on a single chip has boosted the
network traffic which, in turn, has increased the energy consumption. Since higher
clock frequency increases energy dissipation, network clock frequency can be reduced
to lower the energy consumption. However, this would slow-down the communication
process and hurt performance. Instead of reducing the clock frequency, power-gating
can be used to reduce energy consumption by turning off the on-chip components not
being used. Nevertheless, power gating would incur additional delay due to the wake-
up latency of the turned-off on-chip components [33] [34]. Another technique for reducing
power consumption is Dynamic Voltage and Frequency Scaling (DVFS). DVFS adjusts
the interconnection bandwidth by varying the voltage and power levels, and thus, can
reduce the interconnection network energy dissipation [35]. Nonetheless, DVFS increases
the network cost due to the predictor and control circuits, the network complexity, and
incurs additional latency due to misprediction and switching. On the other hand, a routing
algorithm can play a role to constrain the energy consumption of a network. For example,
taking the minimal path would require less energy than the non-minimal path choices but
might congest some links. Therefore, proper selection of a routing algorithm is necessary
to mitigate the energy dissipation problem.
16
1.2.2 Latency
The increase of number of processing cores and memory intensive applications are
driving the network capacity to its limits and incrementing network congestion. Since
congestion can potentially stall the whole network, it is important to reduce network
congestion. One way to reduce network congestion is to increase network resources such as
channel width and number of buffers. Increasing these network resources would decrease
network congestion but increase the cost of the system. Hence, sharing of channels, buffers,
and links can be introduced to overcome the limited network resources and support the
network traffic demand. However, such sharing increases latency due to the delay in
shared network resource allocation. Another technique to speed up packet transmission,
and thus reduce latency, is flow control. Flow control techniques such as buffer allocation
and switch arbitration can be modified to improve latency, but this can increase network
and routing complexity. On the other hand, since network diameter is determined by the
routing algorithm used, the routing algorithm can play a vital role to reduce the network
latency. Nevertheless, both the minimal and non-minimal path routing can increase network
latency, depending on the network load pattern. Therefore, intelligent allocation of network
resources are necessary to keep the network latency a minimum.
1.2.3 Metallic Interconnects
Traditionally, metallic interconnection technology was used to connect the on-chip
components such as processing cores and memory controllers. Metallic interconnection
technology has the advantages of lower energy requirement, high bandwidth, and lower
area requirements. However, with the scaling down of the technology, wire resistance
and inter-wire capacitance are increasing which is increasing the energy consumption and
link latency. Additionally, increasing the number of cores requires multi-hop complex
routing that increases network latency. In order to facilitate lower latency communication,
17
one or more longer bus like links can be introduced, but this would contribute to the
increase of energy consumption and the number of repeaters. Moreover, according to
International Technology Roadmap for Semiconductor (ITRS), the development of metallic
interconnection technology would not be sufficient to satisfy the requirement of future
Chip Multiprocessors (CMPs). Therefore, as a potential solution to the problems faced
by metallic interconnection technology, researchers start to experiment with emerging
technologies such as wireless and photonics for interconnection networks.
1.3 Emerging Technologies in Interconnection Network: Wireless and Photonics
Emerging technologies such as wireless and photonics indicate promising outcomes
and have the potential to be the alternative of the traditional metallic interconnects. In light
of recent scholarly work on wireless and photonic interconnection networks, I will discuss
the advantages and disadvantages of these two technologies along with the architectures in
the following subsections.
1.3.1 Wireless Interconnection Network
Wireless technology offers several advantages such as one-hop communication, mul-
ticasting and broadcasting, reconfiguration of the network, absence of hardwired physical
channels, and Complementary Metal Oxide Semiconductor (CMOS) compatibility. How-
ever, Wireless technology is not energy efficient for short distance communication [36] [37]
and has a limited bandwidth at a 60 GHz center frequency. Additionally, the area footprint
of the wireless transceiver is higher compared to other interconnection technologies.
There are two types of Wireless Network-on-Chip (WiNoC): wireless-only and hybrid-
wireless. A wireless-only system utilizes wireless technology alone to connect the on-chip
components. Because of the limited bandwidth and high transceiver area, wireless-only
network is less common. In contrast, the hybrid-wireless system combines the short-range
metallic and wireless interconnects to communicate between the on-chip components. This
18
Figure 1.3: Layout of a WCube0 on the left and physical structure with addressing ofWCube on the right [3]© ACM DOI 10.1145/1614320.1614345.
system optimizes the usage of local metallic and wireless technology to reduce latency
and energy consumption of the network, and thus, is more common. The bandwidth
limitation of on-chip wireless technology can be circumvented by employing time division
multiplexing (TDM), frequency division multiplexing (FDM), space division multiplexing
(SDM), and code division multiplexing (CDM) techniques. Therefore, most of the hybrid-
wireless networks use a combination of these techniques. One such network is WCube
[3], shown in Figure 1.3. WCube is built on top of CMesh [28] by inserting a micro
wireless router (MWR) for every 64-core cluster. MWR is used to transmit a packet if
the number of wired hops required is higher than the number of wireless hops required.
This network scales logarithmically with the number of cores and provides a lower-latency
and energy-efficient network by optimizing metallic and wireless technologies. However,
WCube is a multi-hop wireless network which does not utilize the advantage of one-hop
wireless transmission, and the wireless hops required increases proportionally with the
level of WCube. Moreover, source WCube overhears its own message due to the nature
of transmission which increases energy consumption. In addition, the number of receivers
required increases multiplicatively with the level of WCube, and the frequency spectrum is
19
Figure 1.4: (a) Subnet architecture and network topology of hubs connected by asmall-world graph [4]. © 2011 IEEE (b) iWISE-256 architecture showing the wirelesscommunication between sets [5]© 2011 IEEE.
not reusable. WiNoC [4] proposes a two-tier hybrid wireless architecture where cores are
divided into subnets and subnets are connected using wired and wireless links, shown in
Figure 1.4 (a). All the cores of a subnet are connected to a hub, and hubs use wired links
to communicate with the neighboring hubs and wireless links for distant hubs. However,
the primary disadvantage of WiNoC is that the CNT antenna used is difficult to fabricate
and a long wire is required to connect the hub with the cores. Moreover, the architecture is
not scalable because increasing the subnet size decreases throughput and increases energy
dissipation per packet as well as area of the network.
iWISE [5] distributes the wireless hubs throughout the network as shown in Figure
1.4 (b). It provides one hop communication between any two cores either using a
wired or wireless link for a 64-core network. It uses a combination of TDM and
FDM to scale to a higher number of cores. It reduces energy consumption and area
requirement with improved performance when compared to other state-of-the-art metallic
and wireless architectures. Nevertheless, the main disadvantage of iWISE is that scalability
becomes expensive and complex. HCWiNoC, another hybrid wireless architecture with
20
distributed hubs, can scale up to kilo-core and double the throughput with a reduced energy
requirement when compared to other state-of-the-art WiNoC architecture [38]. However,
the area cost of this network is high.
1.3.2 Photonic Interconnection Network
Photonic technology includes the advantages of high bandwidth, lower power
requirement, low latency, convenient reconfiguration of the network, multicasting and
broadcasting, and CMOS compatibility. However, Photonic technology requires a physical
waveguide(s) that defines the network connection and optical-only crossbars are not
scalable to kilo-core networks [1]. In addition, this technology involves inefficient off-chip
laser source coupling, static laser power loss, electrical to optical and optical to electrical
conversion loss, and high broadcasting power [6].
Similar to wireless network, photonic network can be of two types: (1) photonic-
only network uses only photonics to facilitate on-chip communication whereas (2) hybrid-
photonic network uses wired link in addition to photonic link for transmission of packets.
Early photonic networks generally use global photonic crossbar with wavelength division
multiplexing (WDM). One such network is Corona presented in [39]. Corona proposes
a photonic crossbar for a 256-core network with core concentration of 4 which provides
one-hop communication between any two cores. Each waveguide contains 64 wavelengths
with an off-chip laser source. Each router is connected to a memory controller through
a photonic link and to an arbitration waveguide to maintain signal integrity. Corona uses
single-write-multiple-read (SWMR) arbitration technique where a router sends messages
to its assigned wavelengths. This message can be read by all other routers of the network.
However, Corona requires laser power proportional to the number of detectors and is not
scalable due to high power and area requirements. Firefly proposes a hybrid-photonic
network that contains multiple global crossbar [6] as shown in Figure 1.5. Unlike Corona,
21
Figure 1.5: Shared waveguide inter-cluster communication is shown on the left andwaveguide for a 256-core architecture is shown on the right [6] © ACM DOI10.1145/1555754.1555808.
in order to reduce broadcasting power, Firefly uses reservation-assisted SWMR (R-SWMR)
where electrical links are used to turn on the destination detector only. It also divides
the network into several smaller clusters. Intra-cluster communication employs electrical
link whereas inter-cluster communication uses multiple photonic crossbars with dense
wavelength division multiplexing (DWDM). The use of multiple smaller crossbars reduces
the hardware complexity and excludes the need of global arbitration. However, the R-
SWMR introduces area and energy overhead, and multiple global link traversals increases
conversion loss and transmission power.
A photonic Clos based network is proposed in [40] that shows improved performance
compared to a global photonic crossbar. It consumes lower energy and area due to
small diameter crossbar network and provides uniform throughput and latency. It is
an optimization of low-radix, high-diameter mesh and high-radix low-diameter crossbar
topology. It requires shorter waveguides with lesser number of rings and provides
22
Figure 1.6: 1024-core ATAC architecture [7]© ACM DOI 10.1145/1854273.1854332.
multiple paths between source and destination. However, multi-hop photonic routing and
randomized oblivious routing increase latency of such a network. ATAC is the first hybrid
optical crossbar network that is scalable to kilo-core [7] and is shown in Figure 1.6. It
divides the network into several smaller clusters. Cores inside the cluster are connected as
electrical mesh network and each cluster contains a hub for global communication. Hubs
are connected by a photonic ring crossbar utilizing the broadcasting facility of photonics
technology. However, photonic broadcast requires high laser power due to peel off by the
detectors, and broadcasting at the hubs using long electrical links also increases power.
Moreover, multi-hop communication and shared hubs increase the network latency.
1.4 Proposed Research and Major Contributions
In this thesis, I research both on-chip and off-chip interconnection networks using
emerging technologies such as wireless and photonics. For on-chip networks, my focus
is to use multiple emerging technologies to provide lower latency and energy-efficient
communication fabric. In the case of off-chip networks, my goal is to explore the feasibility
23
of using emerging technologies as an alternative to the current metallic technology. In the
following subsections, I will discuss these research objectives in detail.
1.4.1 Heterogeneity in Interconnection Network
Emerging technologies are expected to be the future alternative of the traditional
metallic interconnection technology, but as discussed, emerging technologies have
drawbacks similar to metallic interconnection technology. As a result, hybrid networks
are introduced where traditional metallic technology and emerging technologies coexisted
on the same architecture. However, the demands of faster computing machines will exceed
the capacity of the hybrid architectures in the near future. Thus, one emerging technology is
not sufficient, and it is necessary to exploit the benefits of multiple emerging technologies
to provide the desired performance. This integration of multiple emerging technologies
into an interconnection network is called heterogeneity in interconnection network.
In this thesis, I propose to integrate two emerging technologies, photonics and wireless,
on the same chip. Wireless and photonic technologies have the potential to complement
each other in order to boost energy savings and performance gains that cannot be achieved
with a single technology. First, wireless technology is constrained in bandwidth; whereas,
photonics has ample bandwidth. Second, where photonic link requires the presence of
physical waveguide, wireless does not require any hard-wired channel. Third, while
photonic power consumption increases with the increase of waveguide length, wireless
technology is more efficient for distant communication. Fourth, wireless transceiver
footprint is higher compared to other technologies, and smaller photonic crossbar is
area efficient. Therefore, the combination of photonic and wireless technology in an
interconnection network could be promising. My simulation results show that the
proposed heterogeneous architecture consumes 30% less energy/bit than wireless and 14%
more energy/bit than photonic architecture while providing higher saturation throughput
24
when compared to wired, wireless, and photonic networks. In addition, the proposed
heterogeneous architecture occupies 34% more and 35% less area than hybrid-wireless
and photonic-only architectures respectively.
1.4.2 Off-Chip Interconnection Network
Even though the importance of the on-chip communication paradigm cannot be denied,
the off-chip memory access latency also cannot be ignored anymore due to the increase
in off-chip memory accesses. As a result, the industry, currently, is focusing not only on
on-chip latency and energy cost reduction, but also on ways to reduce the off-chip memory
access latency and energy cost. Therefore, emerging technology such as wireless is being
considered to reduce off-chip memory access latency and energy.
The energy cost of the metallic technology increases proportionally with distance.
Since the distance between a memory controller and a DRAM is large (around 50mm
[41]) compared to the on-chip distances (around 5mm), wireless technology can be a better
alternative. Moreover, wireless technology can provide flexible interconnection between
several distant memory modules which becomes complex if metallic technology is used.
For example, memory controllers may need to communicate with each other. This can
be achieved in wireless technology by allocating a unique or shared frequency channel.
In contrast, long wires are required if metallic technology is used. In addition, off-chip
link traversal time for wireless technology is lower compared to the metallic technology.
This is because the metallic technology requires repeaters which introduce RC delay.
My simulation results show that the proposed hybrid-wireless architectures consume on
average 79% less energy per byte with 11% lower execution time when compared to the
baseline wired architectures.
25
1.4.3 Key Contributions and Thesis Organization
In the preceding sub-sections of this section, I described the research focus of this thesis
and presented my main ideas. The major contributions of this thesis are the following:
• Exploration of heterogeneity in interconnection network: The idea of combining
wireless and photonic technologies on the same chip has some technological
limitations. In this thesis, I not only analyze the use of these two technologies on the
same chip in terms of performance but also elaborate on the technological feasibility
of combining them.
• Introduction of reconfigurable links in a heterogeneous network: In addition to the
introduction of the heterogeneous network, I optimized the wireless link usage by
reconfiguring the wireless links at run-time. My simulation results indicate that
the reconfigurable heterogeneous architecture improves the performance (throughput
and latency) by 15% when compared to the baseline heterogeneous architecture.
However, the energy consumption of the reconfigurable heterogeneous architecture
is 7% higher than the baseline heterogeneous architecture for a 256-core network.
• Emerging technologies for off-chip network: Emerging technologies might be the
future alternative to metallic links for off-chip memory access. I explore the use of
wireless technology for the first time to access off-chip memory, DRAM.
The rest of the thesis is organized as follows: chapter two describes the proposed
heterogeneous and reconfigurable heterogeneous architectures with technological aspects,
chapter three delineates the use of wireless technology for off-chip memory access, chapter
four presents the simulation results of the networks proposed in chapter two and three, and
chapter five concludes the thesis.
26
2 Heterogeneous Network-on-Chip
In this chapter, I discuss the two proposed architectures: Optical and Wireless Network-
on-Chip (OWN) architecture for 1024-core CMPs and Reconfigurable Optical and Wireless
Network-on-Chip (R-OWN) architecture for 256-core CMPs. Both architectures combine
the optical and wireless technologies to provide a scalable, low latency, and energy efficient
network-on-chip. I propose to share an optical crossbar by 64 cores (called a cluster) using
wavelength division multiplexing (WDM) technique because this decomposition of optical
crossbars allows to (1) maximize the efficiency of lasers since the lasers are always on, (2)
reduce latency by reducing the wait time for tokens, and (3) reduce insertion losses due to
shorter waveguides. I also propose to use wireless technology to interconnect the clusters
in order to provide one-hop cluster-to-cluster communications [1].
Instead of using wireless interconnects, a second level of metallic or optical
interconnects could be used to connect the clusters, but there would be several
complications. Two complications that can occur with metallic interconnects are: (1) a
metallic interconnect would not scale for a higher number of cores (say, 1024-cores) and
(2) reconfiguring a metallic interconnect-for example, using power gating-would increase
network complexity. Three complications may occur for optical interconnects: (1) multiple
optical layers would cause heat dissipation problem that could deteriorate the network
performance because optics is sensitive to heat, (2) optical networks would require constant
laser power and turning off certain wavelengths would require off-chip transmission which
would incur additional delay, and (3) a higher number of modulators and demodulators
would be required for reconfiguration which would increase power loss. In contrast,
wireless interconnects are ideal for reallocating bandwidth due to the lack of wires and
wide frequency spectrum, and since the antennas are on the chip, they can be turned off if
necessary. As a result, I can build an architecture up to 1024-cores that requires a maximum
of three hops for any-to-any core communication.
27
Figure 2.1: 64-core OWN architecture consisting of a 16 × 16 optical crossbar, datawaveguide(s), and an arbitration waveguide. The structure of a tile and the proposed opticalrouter is shown on the right.
This chapter is organized in three sections. First, I describe in detail the architecture
of OWN with the routing mechanism and deadlock avoidance technique. Second, I
evaluate the technological feasibility of implementing OWN. Third, I build R-OWN on
top of OWN by making the wireless links reconfigurable at runtime to incorporate diverse
communication patterns and describe a deadlock free routing mechanism which is different
from OWN.
2.1 OWN Architecture
In this section, first, I describe the design of a 64-core OWN architecture using optical
technology. Second, I use the 64-core OWN as the basic building block to design a 1024-
core OWN employing wireless technology. Third, I explain the routing mechanism with
examples. Fourth, since the switching of technology (optical to wireless and wireless to
28
optical) may create deadlocks, I propose a technique to ensure deadlock freedom [1] ©
2011 IEEE.
2.1.1 64-Core OWN Architecture: Cluster
The OWN architecture is a tile-based architecture with each tile consisting of four
processing cores and their private L1 instruction and data caches, a shared L2 cache, and
a network interface controller (NIC) or router. The inner components of a tile are shown
in Figure 2.1 for the four cores connected to router 15 (upper right-most tile). Each tile is
located within a cluster, which consists of 16 such tiles (64 cores). The tiles inside a cluster
are represented by two coordinates (r, c) where r is the number of the tile or the router and
c identifies one of the four cores in that tile. These tiles are connected by a 16 × 16 optical
crossbar which is the snake-like optical waveguide and takes one hop for core-to-core
communication, as shown in Figure 2.1. I propose a multiple-write-single-read (MWSR)
scheme with arbitration wherein each tile is assigned dedicated wavelength(s) to receive
messages from the remaining 15 tiles. In contrast, a single-write-multiple-read (SWMR)
scheme requires high laser power because one router writes to its assigned channel and
all the remaining routers can read by peeling off a portion of the wavelengths[6]. I
chose MWSR over SWMR to reduce the laser power consumption; however, the power
consumption can be reduced even in SWMR by tuning only the intended receiver [6].
The tradeoff in using MWSR is increased latency since each router must wait to grab the
token before writing to a specific channel. As there are 16 routers inside the cluster and
communication between the routers requires only one hop, I argue that this latency will not
dramatically affect the performance. Hence, any one of the 15 tiles of the 64-core OWN
architecture can write to the other tiles such that all 16 tiles can read at the same time in
their assigned wavelength(s). Thus, each cluster requires two waveguides. For example,
core (1, 3) wants to send a packet to core (5, 2). Router 1 will wait for the token to modulate
29
Figure 2.2: The basic building block is a tile; sixteen tiles form a cluster, four clusters forma group and four groups form the 1024-core OWN architecture [1]© 2015 IEEE.
the wavelength(s) assigned to router 5 (shown as blue in Figure 2.1). Upon receiving the
token, router 1 will modulate the appropriate wavelength(s) to router 15. In addition, an
arbitration waveguide is used to arbitrate between multiple routers that want to transmit to
the same receiver, so that signal integrity is maintained [1]© 2011 IEEE.
2.1.2 1024-Core OWN Architecture: Cluster and Group
The building blocks of 1024-core OWN architecture is shown in Figure 2.2. As
explained before, sixteen tiles form a cluster, four clusters form a group, and four groups
form the 1024-core OWN architecture. Intra-cluster communication is implemented using
optical interconnects. Inter-cluster communication, which includes intra-group and inter-
group communication, is facilitated using wireless interconnects. Starting at the top level,
since there are four groups, twelve (4P2) unidirectional frequency channels are required
for inter-group communication. Unique pairs of frequency channels are assigned for
communication between each pair of groups. As a result, each group needs three frequency
channels to send packets to the rest of the groups (horizontal, vertical, and diagonal
groups). Each cluster inside a group is assigned three transmitter antennas matched at
those frequencies employing TDM. This ensures that, of the four clusters inside the group,
only one at a time can send data using the shared channel to a destination group. Similarly,
30
each cluster has three receiver antennas tuned at the frequencies of other groups. Since I
use multicast to overcome wireless bandwidth limitation, receivers of all four clusters can
receive messages or packets at the same time. However, each cluster decides whether to
keep or discard the packet(s). Inside a group, the four clusters are connected using a 32
Gbps frequency channel. This frequency channel is shared by the four clusters of a group
where only one of them can write but all of them can receive simultaneously. Therefore,
each cluster of a group will have four transceivers: one for intra-group communication and
three for inter-group communication [1]© 2011 IEEE.
The four corner routers of each cluster (Figure 2.1) are chosen for the on-chip wireless
communication. The complete architecture for a 1024-core OWN is shown on Figure 2.3.
The red transceivers connected with the routers A, B, C, and D indicate the intra-group
wireless communications between the clusters of group 0, 1, 2, and 3 respectively. Only
the routers for the intra-group communications contain the transmitter and the receiver
with both tuned to the same frequency. For example, the intra-group wireless routers A,
B, C, and D have transceivers tuned to the frequency channels F00, F11, F22, and F33
respectively. Routers for the inter-group communication contain a transmitter tuned to the
frequency assigned to that group for communicating with the other groups and a receiver
tuned to the frequency of the sender group. For example, each of the four inter-group
wireless routers E of group 0 in Figure 2.3 contain a transmitter tuned to frequency F01
and a receiver tuned to frequency F10. Similarly, for communicating with the diagonal
groups, each router P of group 2 contains a transmitter tuned to frequency F21 and a
receiver tuned to the transmitting frequency of group 1, F12. From Figure 2.3, it can
be seen that only the frequency channels assigned for the intra-group communications can
be reused employing SDM. This replaces the need of four intra-group frequency channels
F00, F11, F22, and F33 with only one wireless channel, F0. Hence, in total, thirteen 32
31
Figure 2.3: Kilo-core OWN architecture. Routers with the same letter share a frequencychannel and Fxy represent a wireless channel to send packets from group x to group y. Forexample, Routers A, B, C, and D share the intra-group wireless channel F00, F11, F22, andF33 respectively. Routers E, F, G, and H require four inter-group wireless channels F01,F10, F23, and F32 respectively to communicate with the horizontal group. Routers I, J,K, and L require four inter-group wireless channels F02, F20, F13, and F31 respectivelyto communicate with the vertical group. Routers M, N, O, and P require four inter-groupwireless channels F03, F30, F12, and F21 respectively to communicate with the diagonalgroup [1]© 2015 IEEE.
32
Gbps frequency channels are required for the proposed OWN architecture. More on this
wireless technology is explained in the technology section [1]© 2011 IEEE.
2.1.3 Intra-Group and Inter-Group Communication
Consider Figure 2.4 for the detailed communication pattern. Each core in 1024-core
OWN is identified by a 4-digit coordinate with group, cluster, router, and core number. It
is represented as (g, cs, r, c) where g is group, cs is cluster, r is router, and c is core number.
Thus, the total number of cores in OWN is g × cs × r × c, where 0 ≤ g ≤ 3, 0 ≤ cs ≤ 3,
0 ≤ r ≤ 15, and 0 ≤ c ≤ 3. For example, core (2, 2, 0, 1) is in group 2, cluster 2 (top-left
position inside a group), and at the first tile (router 0). If this core wants to send a packet
to core (2, 1, 13, 3), then it is an intra-group communication. The packet from the source
router will be sent to the right-most corner router (2, 2, 3) using optical link when it has
the token to write. Once the packet arrives at the router (2, 2, 3), the router will wait for
the intra-group frequency channel, F0. Once router (2, 2, 3) has the right to transmit, it will
broadcast the packet to the other three routers that are assigned the intra-group wireless
frequency. Only the router (2, 1, 12) at the destination cluster will accept the packet, and
the remaining two routers will discard the packet. When router (2, 1, 12) has the token to
write to the wavelengths assigned to router (2, 1, 13), router (2, 1, 12) will send the packet
to the destination router (2, 1, 13) over the optical link. This will require three hops in the
following sequence: one optical, one wireless, and one optical [1]© 2011 IEEE.
Let me consider inter-group wireless communication between horizontal groups with
source core (2, 3, 14, 3) and destination core (3, 2, 11, 1). The source core (2, 3, 14, 3) will
insert the packet to the router (2, 3, 14). After receiving the token, this router will send the
packet to router (2, 3, 15) using optical link. Router (2, 3, 15) will contend for the wireless
channel F23 with the three other routers (shown as G in Figure 2.4) in that group. Once
it has permission to use the channel F23, the packet will be broadcasted to all four routers
33
Figure 2.4: Intra-group and Inter-group transmission on 1024-core OWN architecture. Thedotted lines represent wireless link whereas the solid lines represent optical link. Routersof the same letter share same frequency channel [1]© 2015 IEEE.
(shown as H in Figure 2.4) of group 3 in the four different clusters. Only router (3, 2, 15)
at the destination cluster will accept the packet. It will then send the packet optically to
the destination router (3, 2, 11). This communication will also take three hops. Hence for
1024-core OWN architecture, the minimum hop count is one (optical, intra-cluster) and the
maximum hop count is three (optical-wireless-optical, inter-cluster). This lower diameter
of OWN contributes to lower energy and latency. Another underlying advantage of OWN
is scalability. In this architecture, I have reused the intra-group frequency. By restricting
the antenna beamwidth, inter-group horizontal and vertical wireless links can be reused
employing SDM [1]© 2011 IEEE.
2.1.4 Deadlock Free Routing
Since OWN combines the optical and wireless technologies in the same architecture,
deadlocks are likely to occur due to the transition from one technology to another. Let
34
us consider Figure 2.5 (a). It shows four packets A, B, C, and D where A and C are
intra-group and B and D are inter-group packets. Packet A originates at router (2, 2, 15),
takes the optical link to router (2, 2, 3), reaches intra-group wireless-network-router (2, 3,
0), and then arrives at the destination router (2, 3, 15) via optical link where it exits the
network. Similarly, the travel path of packet C is: router (3, 2, 15)-optical link-router (3,
2, 3)-intra-group wireless link-router (3, 0, 15)-optical link-router (3, 0, 3). Inter-group
packet B originates at router (2, 3, 0), via optical link reaches router (2, 3, 15), takes inter-
group-horizontal wireless link to router (3, 2, 15), and then arrives at the destination router
(3, 2, 3) via optical link where it exits the network. Similarly, the travel path of the other
inter-group packet D is: router (3, 0, 15)-optical link-router (3, 0, 3)-inter-group horizontal
wireless link-router (2, 2, 15)-optical link-router (2, 2, 3). All the packets require three
hops to reach their respective destination router from the source router. Either A, C or
B, D alone do not create any deadlock, but simultaneous transmission of A, B, C, and D
creates circular dependency. Another case of deadlock that includes inter-group vertical
and horizontal wireless communication with intra-group wireless communication is shown
on Figure 2.5 (b) [1]© 2011 IEEE.
There are different types of deadlock avoidance techniques such as distance class or
dateline class [11]. To avoid deadlocks in OWN architecture, I have followed a form of
dateline class. Each router of OWN has 4 virtual channels (VCs) associated with each
input port. I restrict the VC allocation for each type of communication. Both intra-cluster
and intra-group transmissions use VC0 only. The rest of the VCs-VC1, VC2, and VC3 are
assigned to the flits requiring inter-group horizontal, vertical, and diagonal transmissions
respectively. These VC assignments are followed throughout the lifetime of the packet in
the network. This proposed deadlock avoidance technique ensures that all packets reach
their intended destinations. However, due to this restricted VC allocation, input buffers will
35
Figure 2.5: Possible deadlock scenarios in a 1024-core OWN. Deadlock creation betweengroups using (a) inter-group-horizontal wireless link and (b) intergroup horizontal andvertical wireless link [1]© 2015 IEEE.
not be utilized completely and might contribute to the increase in latency and decrease in
throughput [1]© 2011 IEEE.
2.2 Technology for OWN: Wireless and Optical
In this section, I discuss the technological aspects to implement the proposed OWN
architecture. Except for wireless and optical sections, bulk 45 nm LVT technology is used
for all the other electrical components such as metallic link and router [1]© 2011 IEEE.
2.2.1 Wireless Technology
Although continuing progress in CMOS technology has made the higher frequency
operation in mm-wave possible and thereby reducing the antenna size to a scale suitable
for on-chip implementation, low gains due to low Si substrate resistivity is one of the
challenges of on-chip wireless communication [42]. In my design, monopole antenna is
considered because monopole antennas radiate horizontally in all the directions necessary
36
for broadcasting or multicasting. Additionally, possessive monopole’s ground separates
the substrate from the antenna and, thus, reduces the substrate’s effects on the antenna and
enhances radiation efficiency. The antennas are fabricated at the top most layer of the chip.
To enclose the chip, a nonmetallic ceramic cover can be used, which also can help the
thermal insulation and reduce the multi-path and dispersion concerns [1]© 2011 IEEE.
In OWN architecture, each wireless channel has a bandwidth of 32 Gbps. Since there
are 16 wirelessly communicating pairs, 16 wireless channels are required. The distances
vary between different types of communicating antennas. As shown in Figure 2.3, the
intra-group antennas have the lowest distances while the inter-group-diagonal antennas
have the highest distances. Consequently, required transmission power can be varied in
accordance to the distance covered which allows reuse of a frequency channel on the same
chip without interference [5]. The maximum radiating distance between the intra-group
wireless transceivers is around 1.77 mm (assuming router-router spacing of 1.25 mm with
0.625 mm spacing between the side cores and the edge of the chip). The minimum physical
distance between intra-group wireless routers located in two different groups is around 8.75
mm. Hence, the minimum separation between intra-group antennas of different groups is
almost five times the maximum radiating distance of an intra-group transmitter. Therefore,
only one frequency channel can be used for all the intra-group wireless communications.
Thus, F00, F11, F22, and F33 can be replaced by one wireless channel, for instance F0.
Due to the application of SDM in our design, the total number of wireless channels required
will be reduced from 16 to 13. So, in total, approximately 416 Gbps wireless bandwidth
is required which is achievable [3]. For modulation, OOK is chosen due to its low power
consumption nature. As a result, each wireless link requires three pairs of transmitters and
receivers with each transmitting at ≈10.7 Gbps [5]© 2011 IEEE.
Today in many fabrication facilities, mm-wave circuits are already being implemented
at 65 nm or smaller CMOS technology nodes [43], [44], [45]. With the advances of CMOS
37
Table 2.1: Optical device parameters [1]© 2015 IEEE.
Parameter Value Parameter Value
Waveguide Pitch 4 µm Ring Resonator Diam-
eter
12 µm
Wavelengths/Waveguide 64 Waveguide Loss 1.0 dB/cm
Pass-by Ring Resonator Loss 0.0001 dB Photo-detector Loss 1.0 dB
Splitter Loss 0.2 dB Modulation Loss 1.0 dB
Demodulation Loss 1.0 dB Receiver Sensitivity -17 dBm
Laser Efficiency 15% Ring Heating Power 26 µW/ring
Ring Modulating Power 500
µW/ring
Ring Modulation Fre-
quency
10 GHz
technology and scaling, higher frequency of operation with lower power requirement may
be possible. Based on the current trends in fabrication, wireless link power efficiency could
possibly reach about 1 pJ/bit [37]. Moreover, application of the double-gate MOSFETs
(FinFETs) may lower the threshold voltage of the transistor which will help to reduce
the supply voltage and, as a result, power dissipation. Additionally, a power reduction
of three times and lower losses in ultra-thin Si devices may be projected for RF wireless
transceivers built using 22 nm technology, thanks to smaller passives and improvements
in nano-materials and transistor off-currents. With this admittedly optimistic outlook, I
believe it is possible to reach and even drop below 1pJ/bit energy efficiency for wireless
links to be used in OWN implementation [1]© 2011 IEEE.
2.2.2 Photonics Technology
Optical transmission requires the presence of optical waveguide and ring modulators.
Each waveguide can contain up to 64 wavelengths. My proposed architecture OWN
38
applies WDM to communicate via the optical waveguide. The modulators can modulate
the wavelengths at 10 Gbps using electro-modulation [46]. Since except for the optical
waveguide all the on-chip components are electrical in nature, I need electrical-to-optical
and optical-to-electrical converters at both sides of the optical transmission line. To
convert the electrical signal to optical signal, photodiodes can be used and to convert the
optical signal to electrical signal, photodetectors and cascaded amplifiers can be used. The
technological parameters used in this thesis for optical links are shown in Table 2.1 [1] ©
2011 IEEE.
2.3 Reconfigurable-OWN (R-OWN)
In this section, first, I briefly explain the 256-core OWN architecture. Second, I describe
the design of R-OWN for 256 cores and describe the wireless channel reconfiguration.
Third, I explain the routing mechanisms of 256-core R-OWN with examples. Fourth, I
analyze deadlock situations especially when packets flow from multiple domains (optics to
wireless and wireless to optical) and describe a deadlock-free routing methodology.
2.3.1 256-Core OWN Architecture
Since there are four clusters in a 256-core OWN, twelve (4P2) unidirectional channels
are required to provide cluster-to-cluster wireless communication. Unique pairs of
frequency channels are assigned for communication between each pair of clusters. So,
each cluster needs three frequency channels to talk to the rest of the clusters (horizontal,
vertical, and diagonal cluster). As a result, each cluster contains three transmitters to send
packets to the horizontal, vertical and diagonal cluster. Similarly, each cluster has three
receivers tuned at the transmitter frequencies of other clusters to receive packets. Therefore,
each cluster will have three transceivers: one for horizontal, one for vertical and one for
diagonal cluster communication. The bandwidth of the each wireless channel is assumed
to be 32Gbps.
39
Figure 2.6: 256-core OWN architecture. Routers with the same color communicate witheach other and Fxy represents a wireless channel to send packets from cluster x to cluster y.For example, Routers H0 and H1 communicate with each other over frequency channel F01and F10 respectively while routers V1 and V3 communicate with each other over frequencychannel F13 and F31 respectively.
Three of the four corner routers of each cluster (Figure 2.1) are chosen for the
on-chip wireless communication. The corner routers are chosen to provide maximum
separation between transceivers operating at different frequencies to minimize inter-
channel interference. The innermost corner routers (marked with red box in Figure 2.6)
of 256-core OWN are not used for the convenience of scaling to 1024-core OWN (Figure
2.3) which has been discussed in the previous section.
40
Figure 2.7: Left: Structure of 256-core R-OWN architecture. Right: Structure of a wirelessrouter in R-OWN with transmitters, receivers, counters, and local arbiter [8].
2.3.2 256-Core R-OWN Architecture
OWN 256-core architecture is extended to R-OWN architecture by incorporating
reconfigurability into the network. Each cluster of R-OWN is assigned an adaptive wireless
channel in addition to the fixed wireless channels present in the OWN 256-core network.
So, each wireless router of a cluster contains a transmitter tuned to the adaptive wireless
channel frequency assigned to that cluster and a receiver tuned to the adaptive wireless
channel frequency assigned to other clusters. However, only one of the wireless routers can
operate for a period of time to maintain signal integrity which is determined by an arbiter
(called a local arbiter) located inside the cluster. Therefore, a cluster contains three wireless
routers, three fixed transceiver antennas to communicate with the horizontal, vertical, and
diagonal clusters, three adaptive transceiver antennas, and an arbiter to control the adaptive
transceiver antennas. Since we require 16 channels with a total wireless bandwidth of
512 Gbps, the bandwidth of a wireless channel is 32 Gbps. The architecture of 256-core
R-OWN is shown in Figure 2.7
41
The adaptive wireless channel of each cluster is reconfigured after a reconfiguration
window (set to 100 cycles in our simulation) depending on the number of packets sent to
the other clusters. After every 100 cycles, the local arbiter requests for the wireless link
usages from the wireless routers. Upon receiving the request signal, each wireless router
of a cluster sends their corresponding wireless link utilizations to the local arbiter of this
cluster. The local arbiter determines the destination cluster of the adaptive wireless link for
the next 100 cycles based on the maximum link utilization, resets its counter to zero, sends
a decision signal to each of the wireless router of this cluster, and waits for 100 cycles to
send again a request signal. Upon receiving the decision signal, a wireless router resets
its counter and turns on/off its adaptive antennas. Hence, each wireless router requires
a counter to keep track of the wireless link traversals; each cluster requires an arbiter to
configure adaptive wireless link; and each arbiter requires a counter to count the number of
cycles.
Reconfigurable-Wireless Algorithm
Step 1 Wait for the reconfiguration window, RW
Step 2 Local arbiter, LAi requests the wireless routers (Hi, Vi, Di) for wireless link usage
(WLHi, WLVi, WLDi) where i is the cluster number
Step 3 Hi, Vi, and Di sends WLHi, WLVi, WLDi respectively to LAi
Step 4 LAi finds the maximum of [WLHi, WLVi, WLDi], resets its counter, and sends a
control packet to Hi, Vi, and Di
Step 5 Hi, Vi, and Di respectively resets WLHi, WLVi, WLDi to zero and turn on/off adaptive
antennas
Step 6 Goto Step1
As shown in Figure 2.7, a local arbiter is connected to the wireless routers via metallic
links. In this thesis, I assumed a flit size of 64 bits with four flits in a packet. Hence, a
42
packet takes 8 cycles to transmit through the wireless link. Therefore, each wireless router
requires a 4-bit counter, and each arbiter requires a 7-bit counter. Since the size of the
counters and the width of the metallic links are small, the overhead is insignificant; and
thus, ignored in the performance evaluation (chapter 4).
2.3.3 Routing Mechanism of 256-Core R-OWN
There are four clusters in a 256-core R-OWN where each cluster contains 16 routers
and each router connects 4 cores. A core is represented by a 3-digit coordinate with cluster,
router, and core number as follows: (cs, r, c) where cs is cluster, r is router, and c is core
number. Thus, the total number of cores in R-OWN is cs × r × c, where 0 ≤ cs ≤ 3,
0 ≤ r ≤ 15, and 0 ≤ c ≤ 3. Since cores communicate through routers, I drop the core index
when identifying a router.
Consider the R-OWN communication shown in Figure 2.8. For example, core (0, 0,
0) and core (0, 7, 2) both want to send a packet to core (1, 7, 3), and router (0, 3) (H0)
possess the adaptive wireless link of cluster 0. In other words, the adaptive wireless link
F0 is connected to cluster 1 at this point of time. Both the cores will need to send a packet
to the router H0 for inter-cluster wireless transmission. By modulating the wavelengths
associated with router H0, one of the cores will send a packet first, and then the next core
will send a packet. Assume both the packets are now sitting at the input buffers of router
H0. Since two wireless links (one fixed, F01 and one adaptive, F0) are now connected to the
wireless router H1 of cluster 1, these two packets will be sent concurrently using frequency
channel F01 and F0. At the same reconfiguration time frame, for example, two cores
of cluster 0 want to send packets to cluster 2 which requires the use of vertical wireless
link (F02). Since only one wireless link is connected to cluster 2 from cluster 0, both
the packets will contend for F02 at router V0 and packets will be transmitted serially. In
contrast, say the adaptive wireless link of cluster 1 (F1) is pointing to cluster 3 as shown
43
Figure 2.8: Communication mechanism of 256-core R-OWN. The large dotted linerepresents fixed wireless link, small dotted line represents adaptive wireless link and thesolid line represents optical link. Routers of the same color talk to each other [8].
in Figure 2.8. Hence, core (1, 13, 2) and core (1, 11, 1) both will be able to send packets
at the same time–using fixed wireless channel, F13 and adaptive wireless channel, F1–to
their destination cluster 3 once the packets reach the wireless router V1. This is possible,
because each cluster has its own adaptive wireless link which is configured based on the
outgoing traffic from this cluster only. Now, consider core (1, 13, 2) send the packet first
to destination core (3, 7, 1) using wireless link F13. Then if core (1, 11, 1) wants to talk to
core (3, 0, 3), router V1 will use the wireless channel F1 instead of F13 as F13 was used
last time. I chose to send packets using the adaptive and fixed wireless links alternatively to
minimize contention. However, when a wireless router does not have access to the adaptive
wireless link, I use the dedicated wireless link to communicate with the other clusters.
44
Figure 2.9: (a) Possible deadlock scenario in a 256-core R-OWN for simultaneoustransmission of inter-cluster packets A, B, and C. (b) Proposed network with inclusionof new optical links to avoid deadlocks. A packet is marked with the color of the channelit is using [8].
2.3.4 Deadlock Free Routing
Since R-OWN requires optical to wireless to optical domain transitions, cyclic
dependency exists between the channels which may create deadlock. This is shown in
Figure 2.9 (a) for three packets A, B, and C. Travel paths of packet A, B, and C are D0-
H0-H1-V1, H1-V1-V3-D3, and V3-D3-D0-H0 respectively. Because packet A and C, or
B and A, or C and B use the same optical link, deadlock may occur. There are different
techniques to avoid deadlocks. For R-OWN, I have provided additional channels with
usage restrictions to avoid deadlocks and improve buffer utilization compared to OWN.
I assign new optical links for inter-cluster packets from the source router to the wireless
router. However, on the destination cluster, packets use the optical links that were present
before. As a result, for example, packet A and C take different optical links to travel from
45
router D0 to router H0 which breaks the cyclic dependency. As shown in Figure 2.9 (b),
the proposed network is deadlock-free which ensures all packet delivery. Since an optical
waveguide can contain maximum 64 wavelengths and I can insert these additional optical
links to the existing data waveguide, the tradeoff is increased optical power consumption.
46
3 Off-Chip Interconnection Network
In this chapter, I propose to use wireless technology for both on-chip and off-chip
communications by doing a design space exploration combining wireless and metallic
technology for both on-chip and off-chip communications. Due to the pin bandwidth
limitation, the number of memory controllers used to access the off-chip memory (DRAM)
is not proportionally increasing with the number of cores [47]. In a traditional mesh-based
NoC architecture, the memory controllers are connected at the corner routers only due
to this pin restrictions. Therefore, as core count increases, packets would require more
hops to access off-chip memory which would contribute to an increase in latency and
energy consumption. For example, with private L1 and shared L2 caches, the on-chip
communication delay which comprised of the request packet delay from L1 to L2 and L2
to memory and the response packet delay from memory to L2 and L2 to L1 is significant
[48]. Moreover, the off-chip metallic link connecting the memory controller to the DRAM
cannot be traversed in a single cycle [49]. This would incur additional delay for off-chip
memory accesses.
The problem of longer off-chip memory access latency can be addressed in two
potential ways: (1) by reducing the processing core to the memory controller (request
message) latency and the memory controller to the processing core (response message)
latency, and/or (2) reducing the link traversal latency that connects the memory controller to
the DRAM. Since connecting all the cores directly to the memory controllers using metallic
interconnects is not convenient, positioning the memory controllers carefully on the chip
would dramatically improve the delay scenario [47]. However, this would only partially
solve the problem because the processing cores further away from the memory controller
will still see significant latency. Moreover, on-chip memory controller placement will not
reduce off-chip link traversal latency. Therefore, I propose to use wireless technology
for on-chip as well as off-chip communication to improve both the latency and energy-
47
Table 3.1: Naming convention of the baseline and proposed architectures [2].
General Name Format*: (On-chip)-(Off-chip)-(Antenna Type)-(Bandwidth)
”M” stands for Metallic link
”W” stands for Wireless link
”D” stands for Directional Antenna
”O” stands for Omnidirectional Antenna
”A” stands for Aggressive assumption for wireless BW (512 Gbps)
”C” stands for Conservative assumption for wireless BW (128 Gbps)
*”Antenna Type” and ”Bandwidth (BW)” stand only for wireless networks
efficiency. If wireless technology is used for off-chip communications alone, I use FDM for
transmission between a memory controller and a DRAM. If wireless technology is used for
on-chip communication alone, I use FDM and TDM for on-chip wireless communications
between the routers and the memory controllers. If wireless technology is used for both on-
chip and off-chip communications, I use FDM, TDM, and SDM for wireless transmission.
The end result is that I can provide a maximum of two-hops for any router-memory
controller communication.
This chapter is organized as follows: first, I describe the proposed on-chip and off-chip
hybrid-wireless architectures. Next, I explain the communication protocol of the propose
architectures with examples.
3.1 On-Chip and Off-Chip Wireless Architecture
In this chapter, all the proposed and baseline architectures are 16-core tile-based
architecture where each tile contains a processing core, two caches, and a router (NIC).
The first level cache (L1) is private to the core and the last level cache (L2) is distributed
among the cores. Each router is connected to the caches via input and output ports, neighbor
48
Figure 3.1: General structure of baseline and proposed off-chip wireless architectures. (a)Baseline architecture with both on-chip and off-chip metallic interconnects. (b) Metallicinterconnects for on-chip and wireless interconnects for off-chip communication [2].
routers, a processing core, and memory controllers. The memory controllers are considered
as a switch that can arbitrate between multiple memory requests [47]. The naming
convention of the architectures used in this chapter is given in Table 3.1. For example,
consider the architecture M-M-X-X. Both the first “M” (on-chip) and the second “M” (off-
chip) suggest that the network links are metallic. Because the metallic interconnects are not
constrained in terms of bandwidth and cannot be categorized in different types, the last two
parts are written as “X” (don’t care). The name W-M-O-A indicates that the architecture
uses wireless interconnects for on-chip communication and metallic interconnects for off-
chip communication. The last two letters state that the antenna used for on-chip wireless
network is omnidirectional in nature and the overall bandwidth is 512 Gbps (shown in
3.1). Similarly, W-W-D-C enunciates that both the on-chip and off-chip networks employ
wireless technology for communication using directional antenna having overall bandwidth
of 128 Gbps.
49
3.1.1 Metallic Interconnects (M-M-X-X)
The architecture of M-M-X-X is shown in Figure 3.1 (a). It is used as the baseline
architecture to compare the performance of the proposed architectures. The router-to-
router distance is considered as 5 mm, the shortest router-to-memory controller distance
is 5 mm [47] while the longest router-to-memory controller distance is considered as 10
mm, and the trace length is 50 mm (2 inch) for DDR3 technology [41]. I have placed
the memory controllers at the edge of the chip to provide maximum connectivity between
the memory controllers and the routers using metallic links. The tradeoff is lower link
and router contention with longer links that require higher energy and latency. I have also
assumed distributed off-chip memory where each memory module is serviced by a specific
memory controller.
3.1.2 Hybrid Wireless Interconnect (W/M-W/M-X-X)
On top of the baseline architecture, M-M-X-X, hybrid wireless architecture is built by
inserting wireless links for on-chip and/or off-chip communications. On-chip wireless links
are used to transfer messages to and from the memory controllers, and off-chip wireless
links replace the traditional metallic links that connect the memory controller to the DRAM.
Wireless bandwidth is determined by the technology and the antenna used, and is not the
same for all the proposed architectures. Different types of hybrid wireless architectures
proposed in this thesis are discussed below.
3.1.2.1 On-Chip Hybrid Wireless Interconnect (W-M-X-X)
The routers of the on-chip hybrid wireless interconnect use wireless technology to send
request messages and receive response messages from the distant memory controllers.
However, the traditional metallic links are used for all the router-to-router and nearby
router-to-memory controller communications. One such general architecture is shown
50
Figure 3.2: General structure of proposed on-chip and off-chip wireless architectures. (a)Wireless interconnects for on-chip and metallic interconnects for off-chip communication.(b) Wireless interconnects for both on-chip and off-chip communication [2].
51
in Figure 3.2 (a). The on-chip routers are divided into four groups where each group
contains four routers. Each group is assigned a unique frequency channel to transmit
messages to the distant memory controllers while metallic links are used for nearby
memory controllers. Similarly, each memory controller is assigned a unique frequency
channel to transmit data to the distant router-groups while it uses metallic links for nearby
router-groups. I have considered two types of antennas- omnidirectional and directional
and two wireless bandwidth assumptions- conservative and aggressive. This provides four
different architecture designs. Nevertheless, I have not considered W-M-O-C because
wireless bandwidth of 512 Gbps (aggressive assumption) for omnidirectional type antenna
is well established [1, 3]. Other architectures considered are described below:
• W-M-O-A: As shown in Figure 3.2 (a), the routers of a group share the frequency
channel assigned to the group for sending messages to the memory controllers. For
example, group G0 is assigned a frequency channel to send a message to the memory
controllers MC1 and MC3. The routers (R0, R1, R4, and R5) of G0 share the
frequency channel using a token to maintain signal integrity. Since omnidirectional
antenna is used for wireless communication, both MC1 and MC3 can receive the data
at the same time and then discard if that message is not destined for it. Similarly,
memory controller MC1 uses a frequency channel to send data to the groups G0
and G2 (R8, R9, R12, and R13). Therefore, each router of a group contains one
transmitter to send data to the distant memory controllers and two receivers to receive
data from the distant memory controllers. Each memory controller also contains a
transmitter to send data to the distant router groups and two receivers to receive data
from the distant router groups.
• W-M-D-A: The basic architecture of W-M-D-A is similar to the W-M-O-A
architecture. However, two antennas are required to send data because the antenna
used for wireless communication is a directional type. For example, router R0 of
52
group G0 contains two transmitters: one for sending data to memory controller MC1
and the other to MC3. When router R0 has the token to transmit, it uses one of the two
transmitters depending on the destination memory controller. Similarly, the memory
controller, for example MC1, uses two transmitters to send data to the routers of
groups G0 and G2. Both these transmitters of a router or a memory controller are
tuned at the same frequency. Although the number of transmitters required in W-M-
D-A is double compared to W-M-O-A, in both the number of receivers is the same.
• W-M-D-C: The structure of W-M-D-C is the same as W-M-D-A architecture. The
only difference is the wireless bandwidth used. The wireless link bandwidth of W-
M-D-C is one fourth of the wireless link bandwidth of W-M-D-A. Hence, the latency
for W-M-D-C would be higher than for W-M-D-A.
3.1.2.2 Off-Chip Hybrid Wireless Interconnect (M-W-X-X)
M-W-X-X has the same on-chip architecture as M-M-X-X architecture. However, M-
W-X-X employs wireless links to communicate with the off-chip memory as shown in
Figure 3.1 (b). For this purpose, each memory controller contains a transmitter and a
receiver that is tuned at the frequency of the corresponding DRAMs transmitter. Hence, the
DRAM needs to facilitate a transmitter and to have a receiver that is tuned at the frequency
of the corresponding memory controllers transmitter.
3.1.2.3 On-Chip and Off-Chip Hybrid Wireless Interconnect (W-W-X-X)
This architecture combines the on-chip architecture W-M-X-X and off-chip architecture
M-W-X-X. Since both the on-chip and off-chip networks use wireless technology, I use
SDM technique to overcome the frequency bandwidth limitation. One such architecture is
shown in Figure 3.2 (b). A summary of all the architectures described previously are shown
in Table 3.2.
53
Table 3.2: Summary of the baseline and proposed architectures [2].
M-M-X-X Metallic on-chip interconnects, and metallic off-chip interconnects (link
BW 128 Gbps)
W-M-O-A Hybrid wireless on-chip interconnects with omnidirectional antenna, metal-
lic off-chip interconnects (link BW 128 Gbps), and total on-chip wireless
bandwidth is 512 Gbps
W-M-D-C Hybrid wireless on-chip interconnects with directional antenna, metallic
off-chip interconnects (link BW 128 Gbps), and total on-chip wireless
bandwidth is 128 Gbps
W-M-D-A Hybrid wireless on-chip interconnects with directional antenna, metallic
off-chip interconnects (link BW 128 Gbps), and total on-chip wireless
bandwidth is 512 Gbps
M-W-O-A Metallic on-chip interconnects, off-chip wireless interconnects (link BW 64
Gbps) with omnidirectional antenna, and total off-chip wireless bandwidth
is 512 Gbps
W-W-D-C Hybrid wireless on-chip interconnects with directional antenna, total on-
chip wireless bandwidth is 128 Gbps, off-chip wireless interconnects (link
BW 32 Gbps) with directional antenna, and total off-chip bandwidth is 128
Gbps employing SDM
W-W-D-A Hybrid wireless on-chip interconnects with directional antenna, total on-
chip wireless bandwidth is 512 Gbps, off-chip wireless interconnects (link
BW 128 Gbps) with directional antenna, and total off-chip bandwidth is 512
Gbps employing SDM
54
3.2 Communication Protocol: Metallic and Hybrid Wireless Interconnect
In this thesis, I assume that each processing core requests necessary data from its private
L1 cache. If there is an L1 miss, then a request message is sent through the router to
the L2 cache containing the necessary data. On an L2 miss, a request message is sent
to the memory controller that is servicing the memory module containing the latest data.
After performing the read operation, a DRAM sends the data to the memory controller
that requested the data. Since memories are inclusive, a response message carrying the
data is sent to the requesting routers L2 cache, and then this router sends the data to
the source routers L1 cache. This is the basic communication protocol followed in this
chapter. Following are the architecture specific communication mechanisms. Since an off-
chip wireless or metallic link transmission is identical in terms of communication protocol,
I only focus on on-chip communication in this section.
3.2.1 On-Chip Metallic and Off-Chip Metallic or Wireless Interconnects
Figure 3.3 (a) shows the communication mechanism for an on-chip metallic link based
architecture where the off-chip messages are sent via wireless or metallic links. For
example, if there is a miss at the L1 cache connected to router R0 and this address space is
serviced by the L2 cache connected to router R9, then the L1 cache needs to send a request
message through R0 to the L2 cache via R9. The request message follows the DOR protocol
to reach R9 from R0. If the L2 cache has the updated data, a response message is sent to
R0. However, if there is an L2 miss, then router R9 sends a new request message to the
memory controller servicing that address space. Consider that the memory controller MC3
is servicing the address space of the L2 cache connected to router R9. Hence, R9 sends
a message requesting updated data to MC3, and the message utilizes the DOR protocol
to reach MC3. MC3 sends the necessary signal to the memory module to perform the
read operation either using the metallic link or the wireless link. Upon receiving the data
55
Figure 3.3: Communication mechanism of the proposed architectures for both on-chip andoff-chip metallic and wireless interconnects. (a) On-chip metallic and off-chip metallic orwireless interconnects. (b) On-chip wireless interconnects with omnidirectional antennaand off-chip metallic interconnects. (c) On-chip wireless interconnects with directionalantenna and off-chip metallic or wireless interconnects [2].
from the memory module, MC3 sends a response message to the router R9. The L2 cache
connected to router R9 updates the cache, and R9 sends a new response message to router
R0. These response messages also follow the DOR protocol. The whole communication
takes twelve hops: three hops (R0 to R9), two hops (R9 to MC3), two hops (MC3 to DRAM
to MC3), two hops (MC3 to R9), and three hops (R9 to R0).
56
3.2.2 On-Chip Wireless Interconnects With Omnidirectional Antenna and Off-Chip
Metallic Interconnects
Figure 3.3 (b) shows the communication mechanism for an on-chip wireless link based
architecture where the off-chip messages are sent via wireless or metallic links. For
example, there is a miss at the L1 cache connected to router R0, and the corresponding
address space is serviced by the L2 cache connected to router R5. Then R0 sends a request
message to R5 requesting the data. The request message uses the metallic links following
the DOR protocol. If the L2 cache has the updated data, a response message is sent back to
R0. Consider, there is a L2 miss at router R5, and the memory controller MC1 is servicing
the corresponding address space. Since the transmitter of R5 and the receiver of MC1 are
tuned to the same frequency, R5 waits for the token to send a new request message to MC1
using the wireless link. When R5 has the right to transmit using the wireless link of group
G0, it broadcasts the request message which is received by both memory controllers MC1
and MC3. MC1 accepts the message while MC3 discards it. MC1 collects the data from
the memory module it is connected with via the present off-chip link (wireless/metallic).
MC1 then broadcasts the response message containing the data to the routers of group G0
and G2. Only R5 accepts the message and sends a new response message to router R0. The
new response message follows DOR protocol. The whole communication takes eight hops:
two hops (R0 to R5), one hop (R5 to MC1), two hops (MC1 to DRAM to MC1), one hop
(MC1 to R5), and two hops (R5 to R0). Therefore, the number of hops required to access
the off-chip memory is reduced. The drawback of this communication mechanism is that
router R0 discards the message containing the necessary data which requires R5 to send
the data again.
57
3.2.3 On-Chip Wireless Interconnects With Directional Antenna and Off-Chip
Metallic Interconnects
The basic communication mechanism of on-chip wireless interconnects with directional
antennas is similar to the on-chip wireless interconnects with omnidirectional antennas.
Consider the situation described in the previous sub-section. The only difference is that
R5 contains two transmitters to talk to MC1 and MC3. Hence, when R5 has the right to
transmit, it sends the message using the transmitter pointed towards MC1, and MC3 does
not receive any message. Similarly, when MC1 sends the response message, it uses the
transmitter that is pointed towards group G0. The number of hops required in this case
is also eight and also follows the same sequence. The communication mechanism of this
architecture is shown in Figure 3.3 (c).
58
4 Evaluation of the Proposed Architectures
In this chapter, I analyze the performance of the proposed architectures- OWN, R-
OWN, and On-Chip and Off-Chip Wireless Network by comparing against the state-of-the-
art wired, wireless, or photonic architectures. I restrict my focus to only the area, energy
per bit, latency, and saturation throughput comparison because these are the most critical
parameters of an interconnection network.
The area of an architecture is calculated as a sum of the link (wired, wireless, and
optical) area, router area, wireless transceiver area (wireless networks), and waveguide
area (photonic networks) [1]. I have used Dsent v. 0.91 [49] to calculate the area and
the energy of the wired links and routers for a bulk 45nm LVT technology [1]. For a
wireless link, I have assumed the transmitter area as 0.42 mm2 and the receiver area as
0.20 mm2 [38]. Photonic link area consists of the power, data, and arbitration waveguide
area. To calculate the wired/wireless link energy consumption, I have multiplied the
number of wired/wireless link traversals, collected from the cycle accurate simulation, to
the corresponding wired/wireless link energy [1]. For all the architectures, wireless link
energy-efficiency is assumed to be 1 pJ/bit for on-chip communication and considering a
linear increase, is estimated for off-chip communication [37]. I have assumed a fixed 1
pJ per bit energy consumption for all the on-chip wireless architectures [1]. To calculate
the optical link energy consumption, I have considered the worst case scenario and used
the values of the parameters shown in Table 2.1 [1]. When calculating the router energy
consumption, because Dsent gives the total buffer and crossbar power, I have divided the
buffer energy with the number of buffers and divided the crossbar energy with the radix of
the router [1]. In order for a fair comparison between different topologies, I have kept the
bisection bandwidth and the clock period of the network the same for all the architectures
during simulation. For fairness, I have kept the same number of VC and buffer for all the
architectures [1].
59
Figure 4.1: Layout area comparison between different topologies [1]© 2015 IEEE.
4.1 Performance Evaluation of OWN
To evaluate the performance of the proposed architecture OWN, I compared OWN with
CMesh [28], WCube [3], and ATAC [7] architectures. To simulate network performance
for different types of synthetic traffic patterns such as uniform (UN), bit-reversal (BR),
complement (COMP), matrix transpose (MT), perfect shuffle (PS), and neighbor (NBR), I
have used a cycle accurate simulator [50]. In the case of ATAC and OWN, the architectures
are not completely symmetric. I believe for fairness that, when calculating the overall
bisection bandwidth of the architecture, bisection bandwidth of the wired links for ATAC
and bisection bandwidth of the optical links for OWN should also be considered [1] ©
2011 IEEE.
4.1.1 Area Estimate
As shown in Figure 4.1, ATAC requires the highest area which is 35% higher than
OWN; whereas, WCube and CMesh requires 34% and 66% less area respectively compared
60
to OWN. CMesh and OWN both have 256 routers with a core concentration of 4; ATAC
has 1024 routers with a core concentration of 1; and WCube has 256 routers with a core
concentration of 4 as well as 16 wireless routers connected with 4 other non-wireless
routers. The main reason ATACs area is the highest is the use of a very large number
of routers. Another factor contributing to the large router area of ATAC can be the high
radix of the hubs. To calculate the area of ATAC, instead of calculating the hub area for
67 × 2 radix, I have split the switch into two 4 × 1 and 63 × 1 radix, and then added the
corresponding areas. Although, WCube has a higher number of routers in total than OWN,
OWN requires 4x the number of transmitter antennas than WCube. Because of this, OWN
requires more area than WCube. Since photonic link area is higher than the traditional
wired link area, photonic link area has contributed to the area increase of ATAC or OWN
compared to CMesh or WCube as indicated in the Figure 4.1 [1]© 2011 IEEE.
4.1.2 Energy Estimate
To calculate the wired link energy of ATAC, since the receiver hub broadcasts the flits
to all the cores under that hub, I have multiplied the energy consumption of a hub to a
core link by 16. For OWN, I have included the arbitration waveguide energy consumption
which is not considered for ATAC. WCube is an extension of CMesh and uses wireless
links to transmit packets requiring higher wired hops. During simulation, to provide the
best performance, I have optimized the threshold-distance to use the wireless link instead
of wired link. I have counted the number of wired and wireless hops required for each pair
of source and destination cores and varied the difference between them to find out the best
position to take the wireless link [1]© 2011 IEEE.
Figure 4.2 shows the energy per bit comparison for uniform and perfect shuffle traffic
patterns. (Other patterns have been omitted due to space restrictions). For both of these
cases, WCube consumes less wire link energy because it uses wireless links for distant
61
Figure 4.2: Eergy comparison between different topologies. (Top) Energy per bit foruniform traffic pattern. and (Bottom) Energy per bit for perfect shuffle traffic pattern [1]©2015 IEEE.
transmission. Thus, CMesh has higher wire link energy than WCube. Since ATAC uses
wired mesh network from the source router to the hub and broadcasts at the receiving
end, wire link energy consumption is higher for ATAC. OWN consumes the lowest router
62
energy. This is due to the lower radix of the split router and also that OWN requires
only three hops. Furthermore, because increasing the router radix decreases the energy
consumption compared to multiple router traversals [51], the energy per bit requirement
of OWN is reduced. WCube not only has a higher number of routers but also the radix
of some routers is higher compared to CMesh. ATAC has the highest number of routers
among the four, but ATAC still consumes less router energy than WCube. This is because
WCube shares a single router with 64-cores whereas ATAC shares the router with only
16-cores. WCube has a lower wireless link energy requirement than OWN since WCube
employs wireless links only for distant packets. In contrast, OWN uses wireless links for all
the inter-cluster transmission whether the clusters are neighbors or not. Figure 4.2 shows
that for uniform traffic, OWN consumes 23% higher energy/bit than ATAC and 40% less
energy/bit than WCube; and for perfect shuffle traffic, OWN consumes only 3% higher
energy/bit than ATAC and 21% lower energy/bit than WCube. The energy overhead of
OWN is mostly caused by wireless link energy as can be seen in Figure 4.2. The reduction
of energy per bit of WCube from uniform to perfect shuffle traffic is due to the lower use
of wireless links which is also true for OWN. However, the wireless link energy per bit
requirement is technology dependent. As advances in technology continue, in terms of
energy consumption, OWN will greatly benefit due to the reduction of wireless link energy
per bit compared to the other architectures [1]© 2011 IEEE.
4.1.3 Saturation Throughput and Latency Comparison
In this sub-section, I briefly discuss the latency and saturation throughput of OWN
compared to CMesh, WCube, and ATAC. To imitate ATAC as closely as possible, I have
subtracted the buffer and crossbar delay for the flits travelling from the destination hub to
the cores to represent the broadcast scheme. Figure 4.3 shows the latency for the traffic
types UN, BR, MT, and NBR as a measure of the number of cycles in response to a varied
63
Figure 4.3: Latency shown as Network Load vs. Number of Cycles for various typesof synthetic traffic. (Top-left) Uniform, (Top-right) Bit-reversal, (Bottom-left) Matrixtranspose and (Bottom-right) Neighbor [1]© 2015 IEEE.
network load. For the uniform and bit-reversal traffic shown in Figure 4.3 (top-left and top-
right), OWN performs the best. This is because OWN requires only three hops to transmit
to any part of the network. ATAC requires a higher number of hops than OWN but less
than CMesh and WCube. Since WCube uses wireless links for distant source-destination
pairs, it performs better than CMesh. For matrix transpose traffic, ATAC performs best;
whereas, for neighbor traffic, OWN shows the worst performance as shown in Figure 4.3
(bottom-left and bottom-right). In the case of neighbor traffic, the source and destination
cores are close to each other and this is why CMesh and WCube perform better than the
rest. Since OWN requires a token every time a packet is sent, its performance is affected
[1]© 2011 IEEE.
ATAC shares a hub with 16 routers which are connected using wired mesh topology.
Hence, the packets only need to wait to use the global optical channel, and the received
64
Figure 4.4: Saturation throughput for various types of synthetic traffic pattern [1] ©2015 IEEE.
packets are broadcast to all the hubs. For matrix transpose, source row and source
column are interchanged to form the destination. Since OWN requires a token for every
transmission which ATAC does not, ATAC performs better than OWN. Figure 4.4 shows the
saturation throughput for various synthetic traffic types where GM represent the geometric
mean. Although ATAC has the highest saturation throughput, OWN out performs WCube
and CMesh by 8% and 28% respectively [1]© 2011 IEEE.
4.2 Performance Evaluation of R-OWN
To evaluate the performance of the proposed R-OWN architecture, I have compared
the 256-core OWN and R-OWN architectures with CMesh [28], WCube [3], and Opt-Xbar
architectures. Opt-Xbar is a hypothetical 256-core photonic crossbar architecture with a
snakelike waveguide. It contains 64 routers with a concentration of four cores and uses
MWSR as the arbitration technique. Each router is assigned a unique wavelength(s) where
all the other routers can write if they have the token. Similar to the performance evaluation
65
Figure 4.5: Area comparison between the proposed and state-of-the-art topologies.
of OWN, with R-OWN, I have used a cycle accurate simulator [50] to capture the network
performance.
4.2.1 Area Estimation
The area comparison of R-OWN, OWN, CMesh, WCube, and Opt-Xbar is shown in
Figure 4.5. As can be seen, Opt-Xbar requires the highest area which is 27% higher than
OWN; whereas, WCube, CMesh, and R-OWN require 27%, 17%, and 13% more area
respectively when compared to OWN. OWN, R-OWN, CMesh, and Opt-Xbar, all have 64
routers with a core concentration of 4. Since OWN has a lower number of input ports and
the crossbar of the optical router is split into two (shown in Figure 2.1), OWN requires
less router area. This can be verified by the fact that Opt-Xbar requires less router area than
CMesh because Opt-Xbar has a large number of output ports with fewer input ports. Since I
extend OWN to R-OWN by implementing adaptive wireless transceivers, R-OWN requires
66
a higher number of wireless transceivers than OWN. As a result, R-OWN requires a higher
wireless link area compared to OWN. R-OWN also requires a slightly higher router area
than OWN due to the increase in the radix of the wireless router (the optical router remains
the same). In this analysis, I have ignored the counter and local arbiter area as they are very
small. Since OWN and thus R-OWN contain several smaller crossbars, OWN and R-OWN
require less photonic link area than Opt-Xbar due to the fact that Opt-Xbar contains one
large crossbar.
4.2.2 Energy Estimate
Figure 4.6 shows the energy per bit comparison for UN, BR, MT, and PS traffic patterns
with the geometric mean. WCube has lesser wireless channels than OWN and R-OWN.
Hence, the number of wireless link traversals and, thus, wireless link energy consumption
for WCube is less compared to OWN and R-OWN. Because R-OWN uses more wireless
channels, it consumes more wireless link energy than OWN. The difference is visible
for MT and PS traffic as for these two traffic patterns, adaptive wireless links are well
utilized which is also reflected in their saturation throughput (Figure 4.7). Since photonic
link energy consumption is much lower than the other technologies, it does not affect the
overall energy consumption significantly. Nevertheless, OWN and R-OWN both consume
an order of magnitude lower energy than Opt-Xbar due to a smaller crossbar size. Opt-
Xbar consumes the lowest router energy because it has a lower number of input and a
higher number of output ports. The first factor contributes to the lower buffer energy while
the second factor contributes to the lower crossbar energy per flit. OWN and R-OWN
both consume lower router energy than CMesh and WCube. This is due to the lower
hop requirement, a lower number of input ports with a higher number of output ports,
and splitting of the crossbar. However, R-OWN requires higher router energy compared
to OWN due to the increase of wireless router radix. Compared to OWN and R-OWN,
67
Figure 4.6: Energy per bit comparison between different topologies for various types oftraffic. This energy calculation includes both leakage and dynamic components.
WCube consumes lower wireless link energy and higher wired link and router energy. This
makes WCube the highest energy consuming architecture. The end result is that OWN
consumes 73% higher energy per bit than Opt-Xbar and 7%, 54%, and 62% less energy/bit
than R-OWN, CMesh, and WCube respectively.
4.2.3 Saturation Throughput and Latency Comparison
In this section, I discuss the latency and saturation throughput of OWN and R-OWN
compared to CMesh, WCube, and Opt-Xbar. Figure 4.7 shows the latency for the traffic
types UN, BR, MT, and NBR as a measure of the number of cycles in response to a varied
network load. For the UN, BR, and MT traffic patterns shown in Figure 4.7 (a, b, and
c respectively), OWN and R-OWN both perform better than other architectures with R-
OWN being the best. This is because both OWN and R-OWN require a maximum of three
68
Figure 4.7: Latency comparison between different networks are shown for (a) uniformtraffic, (b) bit-reversal traffic, (c) matrix transpose traffic, and (d) neighbor traffic.
hops to transmit to any part of the network. Opt-Xbar requires less time when the network
load is low, but it saturates earlier than WCube for uniform traffic. This is because, with
the increase of the network load, the wait time for a token in Opt-Xbar increases. This is
also true for OWN and R-OWN. However, in OWN and R-OWN, fewer routers share the
crossbar. Hence, the delay increase is small. This fact also can be verified by observing
that the zero load latency for Opt-Xbar is higher than for OWN and R-OWN. For a low
network load, OWN and R-OWN both have similar latency because the contention in the
network is low and the improvement due to the reconfiguration is small. Nevertheless,
as the load increases, R-OWN performs better than OWN because R-OWN allocates the
adaptive wireless channels efficiently to the routers that are experiencing more traffic. For
the neighbor traffic pattern, Opt-Xbar shows the worst performance as illustrated in Figure
69
Figure 4.8: The saturation throughput of the comparing architectures with geometric mean(GM).
4.7 (d). In the case of the neighbor traffic pattern, the source and destination cores are
close to each other, and the requirement of a token for every communication in Opt-Xbar
increases the delay. CMesh and WCube both perform better than Opt-Xbar since they do
not have such a delay. They also perform similarly because the wireless links in WCube
are underutilized. As wireless link utilization is low, OWN and R-OWN both perform
similarly. Nonetheless, they perform better than CMesh and WCube due to a lower hop
requirement.
Figure 4.8 shows the saturation throughput for traffic types UN, BR, MT, PS, and
NBR where GM is the geometric mean. Because OWN and R-OWN both have the lowest
diameter, they have the highest saturation throughput for UN and MT. In the case of BR,
high inter-cluster communication creates contention at the wireless links, and thus OWN
has less throughput than Opt-Xbar. However, since R-OWN adapts with the network load
70
pattern, R-OWN has the highest throughput. For PS, the utilization of wireless links is
diverse. This causes the saturation throughput of OWN to fall since certain wireless links
are over utilized while the others are underutilized. Hence, for PS, the improvement of R-
OWN with respect to OWN is the highest. As a result, R-OWN has 15% higher saturation
throughput than OWN and OWN has 8%, 16%, and 21% higher saturation throughput than
Opt-Xbar, WCube, and CMesh respectively.
4.3 Performance Evaluation of On-Chip and Off-Chip Wireless Network
The proposed on-chip and off-chip wireless architectures are compared against the
baseline architecture (Table 3.2) to evaluate the performance. I have used a cycle
accurate simulator Multi2Sim [52] to simulate the network performance of the proposed
architectures for PARSEC 2.1 benchmark [53]. The simulation parameters used are shown
in Table 4.1.
4.3.1 Execution Time Estimate
Figure 4.9 shows the execution times of blackscholes benchmark for all the architec-
tures. It can be seen that the proposed architectures, except M-W-O-A and W-W-D-C,
require lower execution times than the baseline architecture M-M-X -X. This is due to the
fact that, for off-chip memory accesses, the proposed architectures require a lower num-
ber of hops than the baseline architecture. Therefore, the hybrid-wireless architectures
that have the highest bandwidth perform the best. Because the off-chip link bandwidth of
W-W-D-C is orders of magnitude lower than the baseline, the improvement achieved by
the hop-count reduction is nullified, and W-W-D-C performs the worst. In the case of M-
W-O-A, there is no reduction in the hop-count for off-chip memory accesses. Moreover,
the off-chip wireless link bandwidth in M-W-O-A is half of the metallic link bandwidth
in M-M-X-X but is higher than the off-chip wireless link bandwidth in W-W-D-C. Hence,
71
Table 4.1: Simulation parameters [2].
Core Frequency [54] 2 GHz MSHR [55] 16
Threads per core [54, 55] 4 Memory Frequency 1 GHz
Cache line [54–57] 64 Byte Address Mapping [54] Interleaving
Page Size [58] 4 KB Memory Latency [52] 200 Cycle
L1-I (private)[55, 56] 32KB, 4 way,
LRU
Channel Width 16 GBps [56,
57], 8 GBps
[54, 56]
L1-D (private) [56] 32KB, 4 way,
LRU
L1 Cache Latency [54,
56, 59]
2 cycle
Trace Length [41] 2 in VC per port 4
L2 (shared) [55] 256 KB/core,
8 way, LRU
On-chip Metallic Inter-
connect Bandwidth
8 GBps
L2 Cache Latency [59] 20 cycle Memory Controller [57] 4
baseline M-M-X-X performs better than M-W-O-A, and M-W-O-A performs better than
W-W-D-C.
4.3.2 Energy per Byte Estimate
The energy per byte requirement for the on-chip components of the proposed and
baseline architectures is shown in Figure 4.10 (a). It can be seen that the architectures that
have metallic on-chip links (M-X-X-X) are more energy efficient than the architectures that
have wireless on-chip links (W-X-X-X). This is because the energy per bit requirement for
a wireless link is higher than a metallic link for shorter distances and also the hop-count
savings are not large enough to overcome this difference. As a result, 5.6% reduction in
72
Figure 4.9: Execution time of PARSEC 2.1 Benchmark, Blackscholes, for the comparedarchitectures [2].
energy efficiency is observed. However, I can argue that as the number of cores on a single
chip increases, this reduction would change because of the increase in the network traffic.
The energy per byte requirement for the off-chip components are shown in Figure 4.10
(b). An improvement of 87% in energy efficiency is achieved when a wireless link is used
for an off-chip communication instead of a metallic link. This is because, unlike a metallic
link, the energy per bit requirement of a wireless link does not increase quadratically with
distance. Moreover, an off-chip metallic link traversal requires more clock cycles than an
off-chip wireless link traversal which takes only one clock cycle. By adding both the on-
chip and off-chip elements, I get the overall energy efficiency which is shown in Figure
4.10 (c). The overall improvement in energy efficiency is about 79% which is due to the
energy savings in the off-chip link traversals.
73
Figure 4.10: Energy per byte comparison for the baseline and the proposed architectures.(a) Energy per byte for the on-chip elements such as router, memory controller, and link.(b) Energy per byte for the off-chip element i.e. the link connecting the memory controllerand the DRAM. (c) Energy per byte for the both on-chip and off-chip elements.
74
5 Conclusions
In this thesis, I proposed two on-chip networks: Optical and Wireless Network-on-
Chip (OWN) and Reconfigurable Optical and Wireless Network-on-Chip (R-OWN). Both
the networks employ both optics and wireless technology to facilitate on-chip core-to-core
communication. My simulation results show that OWN requires 34% more area than
hybrid-wireless architecture WCube and 35% less area than hybrid-optical architecture
ATAC [1]. OWN also consumes 30% less energy per bit than WCube and 14% more energy
per bit than ATAC [1]. Moreover, OWN shows 8% and 28% improvement in saturation
throughput compared to WCube and CMesh architecture respectively [1]. Although OWN
shows improved results compared to other state-of-the-art NoC architectures, I extend
OWN to R-OWN by making the wireless channels reconfigurable. The end result is that
R-OWN consumes 44% and 50% less energy per bit compared to CMesh and WCube
respectively. R-OWN also has saturation throughput that is 27% and 31% higher than
WCube and CMesh resprectively. In addition, R-OWN requires 3.9% and 12% less area
compared to CMesh and WCube respectively.
I also proposed the use of wireless technology for off-chip memory access in this thesis.
My proposed on-chip and off-chip wireless network (W-W-D-A) shows significant energy
and latency improvement. W-W-D-A requires 11% less execution time compared to the
wired baseline architecture. W-W-D-A also consumes approximately 79% less energy per
packet compared to the baseline architecture. However, the proposed network may incur
an area overhead.
75
References
[1] M. A. I. Sikder, A. K. Kodi, M. Kennedy, S. Kaya, and A. Louri, “Own: Opticaland wireless network-on-chip for kilo-core architectures,” in High-PerformanceInterconnects (HOTI), 2015 IEEE 23rd Annual Symposium on. IEEE, 2015, pp.44–51.
[2] M. A. I. Sikder, D. DiTomaso, A. K. Kodi, W. Rayess, D. Matolak, and S. Kaya,“Exploring wireless technology for off-chip memory access,” in High-PerformanceInterconnects (HOTI), 2016 IEEE 24rd Annual Symposium on. IEEE, 2016.
[3] S.-B. Lee, S.-W. Tam, I. Pefkianakis, S. Lu, M. F. Chang, C. Guo, G. Reinman,C. Peng, M. Naik, L. Zhang et al., “A scalable micro wireless interconnect structurefor cmps,” in Proceedings of the 15th annual international conference on Mobilecomputing and networking. ACM, 2009, pp. 217–228.
[4] A. Ganguly, K. Chang, S. Deb, P. P. Pande, B. Belzer, and C. Teuscher, “Scalablehybrid wireless network-on-chip architectures for multicore systems,” Computers,IEEE Transactions on, vol. 60, no. 10, pp. 1485–1502, 2011.
[5] D. DiTomaso, A. Kodi, S. Kaya, and D. Matolak, “iwise: Inter-router wireless scal-able express channels for network-on-chips (nocs) architecture,” in High PerformanceInterconnects (HOTI), 2011 IEEE 19th Annual Symposium on. IEEE, 2011, pp. 11–18.
[6] Y. Pan, P. Kumar, J. Kim, G. Memik, Y. Zhang, and A. Choudhary, “Firefly: illu-minating future network-on-chip with nanophotonics,” in ACM SIGARCH ComputerArchitecture News, vol. 37, no. 3. ACM, 2009, pp. 429–440.
[7] G. Kurian, J. E. Miller, J. Psota, J. Eastep, J. Liu, J. Michel, L. C. Kimerling,and A. Agarwal, “Atac: a 1000-core cache-coherent processor with on-chipoptical network,” in Proceedings of the 19th international conference on Parallelarchitectures and compilation techniques. ACM, 2010, pp. 477–488.
[8] M. A. I. Sikder, A. K. Kodi, and A. Louri, “Reconfigurable optical and wireless (r-own) network-on-chip for high performance computing,” in Proceedings of the ThirdAnnual International Conference on Nanoscale Computing and Communication, ser.NANOCOM’ 16. ACM, 2016.
[9] J. L. Hennessy and D. A. Patterson, Computer architecture: a quantitative approach.Elsevier, 2011.
[10] K. Olukotun, L. Hammond, and J. Laudon, “Chip multiprocessor architecture:techniques to improve throughput and latency,” Synthesis Lectures on ComputerArchitecture, vol. 2, no. 1, pp. 1–145, 2007.
76
[11] W. J. Dally and B. P. Towles, Principles and practices of interconnection networks.Elsevier, 2004.
[12] “Intel Xeon Processor E5-4669 v3 (45M Cache, 2.10GHz),” 2015. [Online]. Available: http://ark.intel.com/products/85766/
Intel-Xeon-Processor-E5-4669-v3-45M-Cache-2 10-GHz
[13] “Intel Xeon Processor E5-2603 v2 (10M Cache, 1.80GHz),” 2013. [Online]. Available: http://ark.intel.com/products/76157/
Intel-Xeon-Processor-E5-2603-v2-10M-Cache-1 80-GHz
[14] “Intel Itanium Processor 9560 (32M Cache, 2.53 GHz),”2012. [Online]. Available: http://ark.intel.com/products/71699/
Intel-Itanium-Processor-9560-32M-Cache-2 53-GHz
[15] “Intel Xeon phiT M Coprocessor 5110P (8GB, 1.053 GHz, 60Core),” 2012. [Online]. Available: http://ark.intel.com/products/71992/
Intel-Xeon-Phi-Coprocessor-5110P-8GB-1 053-GHz-60-core
[16] “Intel CoreT M2 Duo Processor E7500 (3M Cache, 2.93 GHz, 1066MHz FSB),” 2009. [Online]. Available: http://ark.intel.com/products/36503/
Intel-Core2-Duo-Processor-E7500-3M-Cache-2 93-GHz-1066-MHz-FSB
[17] “Intel Xeon Processor E5520 (8M Cache, 2.26 GHz, 5.86 GT/sIntel QPI),” 2009. [Online]. Available: http://ark.intel.com/products/40200/
Intel-Xeon-Processor-E5520-8M-Cache-2 26-GHz-5 86-GTs-Intel-QPI
[18] “Intel AtomT M Processor N270 (512K Cache, 1.60 GHz, 533MHz FSB),” 2008. [Online]. Available: http://ark.intel.com/products/36331/
Intel-Atom-Processor-N270-512K-Cache-1 60-GHz-533-MHz-FSB
[19] “Intel CoreT M i7-920 Processor (8M Cache, 2.66 GHz, 4.80 GT/sIntel QPI),” 2008. [Online]. Available: http://ark.intel.com/products/37147/
Intel-Core-i7-920-Processor-8M-Cache-2 66-GHz-4 80-GTs-Intel-QPI
[20] “Intel Pentium D Processor 805 (2M Cache, 2.66 GHz, 533MHz FSB),” 2005. [Online]. Available: http://ark.intel.com/products/27511/
Intel-Pentium-D-Processor-805-2M-Cache-2 66-GHz-533-MHz-FSB
[21] “Intel Pentium 4 Processor 2.80 GHz, 512K Cache, 533 MHzFSB,” 2002. [Online]. Available: http://ark.intel.com/products/27447/
Intel-Pentium-4-Processor-2 80-GHz-512K-Cache-533-MHz-FSB
[22] “Intel Pentium III Processor 1.00 GHz, 256K Cache, 133 MHzFSB,” 2000. [Online]. Available: http://ark.intel.com/products/27529/
Intel-Pentium-III-Processor-1 00-GHz-256K-Cache-133-MHz-FSB
77
[23] “Intel Pentium Pro Processor 200 MHz, 512K Cache, 66 MHzFSB,” 1995. [Online]. Available: http://ark.intel.com/products/49953/
Intel-Pentium-Pro-Processor-200-MHz-512K-Cache-66-MHz-FSB
[24] “Intel Pentium II Processor,” 1998. [Online]. Available: http://www.intel.com/design/
pentiumii/prodbref/#performance
[25] “SPARC M7-8 Server,” 2015. [Online]. Available: http://www.oracle.com/us/products/servers-storage/sparc-m7-8-servers-ds-2695738.pdf
[26] “AMD OpteronT M 6300 Series Processors,” 2014. [Online]. Available:http://www.amd.com/en-us/products/server/opteron/6000/6300#
[27] “AMD-K5T M Processor,” 1997. [Online]. Available: http://datasheets.chipdb.org/
upload/Unzlbunzl/AMD/18522F%20AMD-K5.pdf
[28] J. Balfour and W. J. Dally, “Design tradeoffs for tiled cmp on-chip networks,” inProceedings of the 20th annual international conference on Supercomputing. ACM,2006, pp. 187–198.
[29] J. Held, “Single-chip cloud computer,” in an IA Tera-Scale Research Processor. In:Guarracino, MR, Vivien, F., Traff, JL, Cannatoro, M., Danelutto, M., Hast, A., Perla,F., Knupfer, A., Di Martino, B., Alexander, M.(eds.) Euro-Par-Workshop, 2010, p. 85.
[30] T. G. Mattson, R. Van der Wijngaart, and M. Frumkin, “Programming the intel 80-core network-on-a-chip terascale processor,” in Proceedings of the 2008 ACM/IEEEconference on Supercomputing. IEEE Press, 2008, p. 38.
[31] A. Jantsch and H. Tenhunen, “Network on chip,” in Proceedings of the ConferenceRadio vetenskap och Kommunication, Stockholm, 2002.
[32] B. D. de Dinechin, P. G. de Massas, G. Lager, C. Leger, B. Orgogozo, J. Reybert, andT. Strudel, “A distributed run-time environment for the kalray mppa®-256 integratedmanycore processor,” Procedia Computer Science, vol. 18, pp. 1654–1663, 2013.
[33] L. Chen and T. M. Pinkston, “Nord: Node-router decoupling for effective power-gating of on-chip routers,” in Proceedings of the 2012 45th Annual IEEE/ACMInternational Symposium on Microarchitecture. IEEE Computer Society, 2012, pp.270–281.
[34] R. Das, S. Narayanasamy, S. K. Satpathy, and R. G. Dreslinski, “Catnap: energyproportional multiple network-on-chip,” in ACM SIGARCH Computer ArchitectureNews, vol. 41, no. 3. ACM, 2013, pp. 320–331.
[35] J. Murray, P. P. Pande, and B. Shirazi, “Dvfs-enabled sustainable wireless nocarchitecture,” in SOC Conference (SOCC), 2012 IEEE International. IEEE, 2012,pp. 301–306.
78
[36] K. Chang, S. Deb, A. Ganguly, X. Yu, S. P. Sah, P. P. Pande, B. Belzer, andD. Heo, “Performance evaluation and design trade-offs for wireless network-on-chip architectures,” ACM Journal on Emerging Technologies in Computing Systems(JETC), vol. 8, no. 3, p. 23, 2012.
[37] D. DiTomaso, A. Kodi, D. Matolak, S. Kaya, S. Laha, and W. Rayess, “A-winoc:Adaptive wireless network-on-chip architecture for chip multiprocessors,” Paralleland Distributed Systems, IEEE Transactions on, vol. 26, no. 12, pp. 3289–3302, 2015.
[38] A. K. Kodi, M. A. I. Sikder, D. DiTomaso, S. Kaya, S. Laha, D. Matolak,and W. Rayess, “Kilo-core wireless network-on-chips (nocs) architectures,” inProceedings of the Second Annual International Conference on Nanoscale Computingand Communication, ser. NANOCOM’ 15. ACM, 2015, pp. 33:1–33:6.
[39] D. Vantrease, R. Schreiber, M. Monchiero, M. McLaren, N. P. Jouppi, M. Fiorentino,A. Davis, N. Binkert, R. G. Beausoleil, and J. H. Ahn, “Corona: System implicationsof emerging nanophotonic technology,” in ACM SIGARCH Computer ArchitectureNews, vol. 36, no. 3. IEEE Computer Society, 2008, pp. 153–164.
[40] A. Joshi, C. Batten, Y.-J. Kwon, S. Beamer, I. Shamim, K. Asanovic, andV. Stojanovic, “Silicon-photonic clos networks for global on-chip communication,”in Proceedings of the 2009 3rd ACM/IEEE International Symposium on Networks-on-Chip. IEEE Computer Society, 2009, pp. 124–133.
[41] I. Micron Technology, “Tn-41-13: Ddr3 point-to-point design support,” 2013.
[42] H. M. Cheema and A. Shamim, “The last barrier,” IEEE Microwave Magazine,vol. 14, no. 1, pp. 79–91, 2013.
[43] A. Balteanu, S. Shopov, and S. P. Voinigescu, “A 2× 44gb/s 110-ghz wirelesstransmitter with direct amplitude and phase modulation in 45-nm soi cmos,” inCompound Semiconductor Integrated Circuit Symposium (CSICS), 2013 IEEE.IEEE, 2013, pp. 1–4.
[44] K. Nakajima, A. Maruyama, T. Murakami, M. Kohtani, T. Sugiura, E. Otobe,J. Lee, S. Cho, K. Kwak, J. Lee et al., “A low-power 71ghz-band cmos transceivermodule with on-board antenna for multi-gbps wireless interconnect,” in MicrowaveConference Proceedings (APMC), 2013 Asia-Pacific. IEEE, 2013, pp. 357–359.
[45] J. A. Z. Luna, A. Siligaris, C. Pujol, and L. Dussopt, “A packaged 60 ghz low-powertransceiver with integrated antennas for short-range communications.” in RWS, 2013,pp. 355–357.
[46] Q. Xu, S. Manipatruni, B. Schmidt, J. Shakya, and M. Lipson, “12.5 gbit/s carrier-injection-based silicon micro-ring silicon modulators,” Optics express, vol. 15, no. 2,pp. 430–436, 2007.
79
[47] D. Abts, N. D. Enright Jerger, J. Kim, D. Gibson, and M. H. Lipasti, “Achievingpredictable performance through better memory controller placement in many-corecmps,” in ACM SIGARCH Computer Architecture News, vol. 37, no. 3. ACM, 2009,pp. 451–461.
[48] A. Sharifi, E. Kultursay, M. Kandemir, and C. Das, “Addressing end-to-end memoryaccess latency in noc-based multicores,” in Microarchitecture (MICRO), 2012 45thAnnual IEEE/ACM International Symposium on, Dec 2012, pp. 294–304.
[49] C. Sun, C.-H. Chen, G. Kurian, L. Wei, J. Miller, A. Agarwal, L.-S. Peh, andV. Stojanovic, “Dsent-a tool connecting emerging photonics with electronics for opto-electronic networks-on-chip modeling,” in Networks on Chip (NoCS), 2012 SixthIEEE/ACM International Symposium on. IEEE, 2012, pp. 201–210.
[50] A. Kodi and A. Louri, “A system simulation methodology of optical interconnects forhigh-performance computing systems,” J. Opt. Netw, vol. 6, no. 12, pp. 1282–1300,2007.
[51] J. Kim, W. J. Dally, B. Towles, and A. K. Gupta, “Microarchitecture of a high-radixrouter,” in ACM SIGARCH Computer Architecture News, vol. 33, no. 2. IEEEComputer Society, 2005, pp. 420–431.
[52] R. Ubal, J. Sahuquillo, S. Petit, P. Lopez, Z. Chen, and D. R. Kaeli, “The multi2simsimulation framework: A cpu-gpu model for heterogeneous computing,” 2011.
[53] C. Bienia, S. Kumar, J. P. Singh, and K. Li, “The parsec benchmark suite: Char-acterization and architectural implications,” in Proceedings of the 17th internationalconference on Parallel architectures and compilation techniques. ACM, 2008, pp.72–81.
[54] I. Bhati, Z. Chishti, S.-L. Lu, and B. Jacob, “Flexible auto-refresh: Enabling scalableand energy-efficient dram refresh reductions,” in Computer Architecture (ISCA), 2015ACM/IEEE 42nd Annual International Symposium on, June 2015, pp. 235–246.
[55] J. Ahn, S. Yoo, O. Mutlu, and K. Choi, “Pim-enabled instructions: A low-overhead,locality-aware processing-in-memory architecture,” in Computer Architecture (ISCA),2015 ACM/IEEE 42nd Annual International Symposium on, June 2015, pp. 336–348.
[56] Y. Lee, J. Kim, H. Jang, H. Yang, J. Kim, J. Jeong, and J. Lee, “A fully associative,tagless dram cache,” in Computer Architecture (ISCA), 2015 ACM/IEEE 42nd AnnualInternational Symposium on, June 2015, pp. 211–222.
[57] O. Seongil, Y. H. Son, N. S. Kim, and J. H. Ahn, “Row-buffer decoupling: Acase for low-latency dram microarchitecture,” in Computer Architecture (ISCA), 2014ACM/IEEE 41st International Symposium on, June 2014, pp. 337–348.
80
[58] A. Ros and S. Kaxiras, “Callback: Efficient synchronization without invalidation witha directory just for spin-waiting,” in Computer Architecture (ISCA), 2015 ACM/IEEE42nd Annual International Symposium on, June 2015, pp. 427–438.
[59] L. Peled, S. Mannor, U. Weiser, and Y. Etsion, “Semantic locality and context-basedprefetching using reinforcement learning,” in Computer Architecture (ISCA), 2015ACM/IEEE 42nd Annual International Symposium on, June 2015, pp. 285–297.