comparison of interconnection networks - semantic …€¦ · comparison of interconnection...

24
Department of Information Technology CT30A7001 – Concurrent and parallel computing Document for seminar presentation Comparison of interconnection networks Otso Lonka Student ID: 0279351 [email protected] Andrey Naralchuk Student ID: B0331697 [email protected] Lappeenranta, 2008 1

Upload: phamtuyen

Post on 09-May-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

Department of Information Technology

CT30A7001 – Concurrent and parallel computing

Document for seminar presentation

Comparison of interconnection networks

Otso Lonka Student ID: 0279351 [email protected] Andrey Naralchuk Student ID: B0331697 [email protected]

Lappeenranta, 2008

1

2

TABLE OF CONTENTS

1. Introduction........................................................................................................................................... 3

1.1. Background knowledge ................................................................................................................. 3

1.2. Goals and limitations ..................................................................................................................... 3

1.3. The structure of the work ............................................................................................................... 3

2. Interconnect network architectures ....................................................................................................... 4

2.1. Main concept.................................................................................................................................. 4

2.2. Different interconnection networks in general............................................................................... 5

3. Comparison of interconnection networks ............................................................................................. 7

3.1. In the scientific publications before 1994...................................................................................... 7

3.1. In the scientific publications after 1994 till nowadays ................................................................ 10

4. TOP500.org statistics .......................................................................................................................... 19

5. Summary ............................................................................................................................................. 22

References............................................................................................................................................... 23

3

1. Introduction

1.1. Background knowledge

This document is a seminar presentation for a course Concurrent and Parallel Computing. The topic of

the presentation is Comparison of interconnection networks. Our task is to compare interconnection

network solutions. The comparison is done based on scientific publications that were available.

1.2. Goals and limitations

The key question in this work is to gain information of interconnection networks and report the results

of comparison. We will overview interconnect networks and clarify which features of interconnection

network are needed for supercomputing. We will also introduce some interconnection networks as an

example. There are many interconnection networks available and this paper will not consider all of

them.

1.3. The structure of the work

The first part of the document considers of architectures of interconnection networks at general level.

The second chapter consists of the comparison of interconnection networks based on scientific articles

and it will also explain why it is difficult to compare different interconnection networks. The fourth

chapter will introduce statistics and charts from top500.org and explain them. At the last chapter, we

will summarize what we did and what we discovered of interconnection networks.

2. Interconnect network architectures

2.1. Main concept

Parallel computer consists of processing units, memory banks and interconnection network. Processing

units are responsible of data processing and interconnection network is responsible of data transfer

between processing units and memory banks. [1]

Interconnection network provides connections between processing nodes in a parallel processing

system. Interconnection networks can be classified as static or dynamic Picture 1. A static network is a

point-to-point network between processing nodes. A processing node is connected to another

processing node without use of any external switching elements. Static interconnection network can be

also referred as direct interconnection network. Dynamic interconnection networks are built by using

switches and cables between processing elements. Network switches and connections form an

interconnection network and the processing units are separate from the network. Dynamic networks are

called as indirect networks. [1] Dynamic interconnection networks have scalability features in general.

Picture 1. Static and Indirect networks [1].

The links of an interconnection network can be either based on conducting material or fiber. [1] For the

links built of conducting material, there is always some amount of signal loss and noise. A capacitance

between a signal path and a ground limits the frequency range which can be used for data transfer. The

cable works itself as a low pass filter, which filters high frequencies. The longer the cable is, the more

capacitance it contains. [1] The higher frequencies are needed to achieve high data transfer rate. Thus

4

the length of the link limits the capacity of data transfer when using links based on conducting material.

For fiber cables there is no such a limitation as the signal is transferred in a non-electrical form. By

using fiber cables, the lengths of the links are not so critical and this is probably one of the reasons,

why clustering is so popular in modern supercomputers instead of using PCBs.

2.2. Different interconnection networks in general

Multistage crossbar is an interconnection network in which all input ports are directly connected to all

output ports without interference from messages from other ports. In a one-stage crossbar this has the

effect that for instance all memory modules in a computer system are directly coupled to all CPUs. This

is often the case in multi-CPU vector systems. In multistage crossbar networks the output ports of one

crossbar module are coupled with the input ports of other crossbar modules. In this way one is able to

build networks that grow with logarithmic complexity, thus reducing the cost of a large network [2].

A hypercube is the family of direct or indirect binary n-cubes is a special case of the family of k-ary n-

cubes with n dimensions and k nodes in each dimension. n-cube is used to denote the binary n-cube.

The n-dimensional hypercube is composed of nodes and has n edges per node. If unique n-bit

binary addresses are assigned to the nodes of the hypercube, then an edge connects two nodes if and

only if their binary addresses differ in a single bit [3].

n2

A mesh is the family of interconnection networks that are a special case of k-ary n-cube networks in

which the number of dimensions, n, is two.

Until recently Myrinet was the market leader in fast cluster networks and is still one of the largest. The

first Myrinet implementation as an alternative for Ethernet to connect the nodes in a cluster. Myrinet

uses cut-through routing for an efficient utilization of the network. Also RDMA is used to write to/read

from the remote memory of other host adaptor cards, called Lanai cards. These cards interface with the

PCI-X bus of the host they are attached to. The latest Myrinet implementation only uses fibers as signal

carriers.

Since the start of 2006 Myricom provides a multi-protocol switch (and adapters): The Myri-10G. Apart

from Myricom's own MX protocol it also supports 10 Gigabit Ethernet which makes it easy to connect

to external nodes/clusters. The specifications as given by Myricom are quite good: ≅ 1.2 GB/s for the

uni-directional theoretical bandwidth for both its MX protocol and about the same for the MX

5

6

emulation of TCP/IP on Gigabit Ethernet. According to Myricom there is no difference in bandwidth

between MX and MPI and also the latencies [4].

Infiniband has become rapidly a widely accepted medium for internode networks. The specification

was finished in June 2001. Infiniband is employed to connect various system components within a

system. Via Host Channel Adapters (HCAs) the Infiniband fabric can be used for interprocessor

networks, attaching I/O subsystems, or to multi-protocol switches like Gbit Ethernet switches. Because

of this versatility, the market is not limited just to the interprocessor network segment and so Infiniband

is expected to become relatively inexpensive because a higher volume of sellings can be realised. The

characteristics of Infiniband are rather nice.

Conceptually, Infiniband knows of two types of connectors to the system components, the Host

Channel Adapters (HCAs), already mentioned, and Target Channel Adapters (TCAs). Infiniband

defines a basic link speed of 2.5 Gb/s (312.5 MB/s) but also a 4× and 12× speed of 1.25 GB/s and 3.75

GB/s, respectively. Also HCAs and TCAs can have multiple ports that are independent and allow for

higher reliability and speed [5].

QsNet(Quadrics) is the main product of Quadrics, a company that initially started in the 1990s. The

network has effectively two parts: the ELAN interface cards, comparable to Infiniband Host Bus

Adaptors or Myrinet's Lanai interface cards, and the Elite switch, comparable to an Infiniband

switch/router or a Myrinet switch. The topology that is used is a (quaternary) fat tree. The ELAN card

interfaces with the PCI-X port of a host computer.

The second generation is QsNet II. The structure, protocols, etc., are very similar to those of the former

QsNet but much faster: where in a speed of 1.3 GB/s. In this case the PCI-X bus speed is a limiting

factor: it allows for somewhat more than about 900 MB/s. However, with the advent of PCI Express

this limitation may be over. Also the latency for short messages has improved from ≅ 5 µs.

Since early 2006 Quadrics also offers 10 Gbit Ethernet cards and switches under the name QSTenG.

These products capitalise on the experience obtained in the development of the two generations of

QsNet but they are not meant for inter-processor communication [6].

SCI, standing for Scalable Coherent Interface. SCI became by October 1992. It was born as a reaction

to the limitations that were encountered in bus-based multiprocessor systems.

In clusters SCI is always used as just an internode network. Because of the ring structure of SCI this

7

means that the network is arranged as a 1-D, 2-D, or 3-D torus. One of the SCI vendors provides

software to reconfigure the torus in such a way that the minimal number of nodes becomes unavailable.

Furthermore, it is not possible to add or remove an arbitrary number of nodes in the cluster because of

the torus topology.

Bandwidths are reported for the SCI-based clusters: up to about 320 MB/s for a Ping-Pong experiment

over MPI with very low latencies of 1—2 µs for small messages [7].

3. Comparison of interconnection networks

3.1. In the scientific publications before 1994

I have found some scientific article about problems to compare interconnection networks.

One of the problems is which metrics of the network should be determined. A sampling of some of the

metrics found in the literature include: message delay characteristics, permuting ability, partitionability,

fault tolerance, bisection width, graph theoretic measures, throughput, complexity of control, and cost-

effectiveness. In general, it is difficult to select the “important” metrics, because they may vary with the

intended application and operating assumptions. Basing design decisions on different sets of metrics

can result in different design choices [12].

Another difficulty in comparing two different network designs is determining if they are of equal

“cost”, so that the comparison will be “fair.” The “cost” should be defined in terms of hardware

required, rather than actual dollar cost [12].

In any way let’s to define what the tendency is to choice of type’s interconnection network. One of the

important characteristics of a supercomputer is a supercomputer cost. The customers look for a

supercomputer with high-performance and low cost of it. And a cost of interconnection networks plays

not the last role in the full cost of supercomputer. How are characteristics of supercomputers in part of

interconnections network been described in the publications?

One of the publications was about comparison bus and crossbar [11]. Authors said that traditionally, the

choice for interconnection architecture has been a bus. Buses are relatively simple and the hardware

cost is small. The drawback of a single bus is that only one agent can transmit at a time. One alternative

8

to bus-based architectures is a switched network that is described into a circuit-switched crossbar.

Crossbar is a low latency, non-blocking interconnection network for large data transfers. However, the

hardware cost of a crossbar is relatively large. One of the author opinions is to use hierarchical

structures where interconnection segments are connected with bridges (bus) or hierarchical switches

(crossbar) that was implemented in big supercomputer. Crossbar is relatively complex system

architecture. However, if the data transmissions are long and if they can be done in parallel, the

hardware complexity can be justified [11].

The publication about analysis of the network characteristics that was presented by Anant Agarwal [13]

is based on model. The model includes the effects of packet size, communication locality and other

characteristics.

Network analysis was made under various constraints. After comparison author found out that a

message length plays an important role in the tradeoff between low and high-dimensional networks.

Longer messages (such as those in message passing machines) make the relative effect of network

distance from source to destination less important, while the lower expected message lengths in shared

memory machines increase the relative influence of network distance, and tend to favor networks with

more dimensions.

Communication locality depends on several factors including the architecture of the multi processor,

the compiler and runtime systems, and the characteristics of parallel applications. If the communication

locality of parallel applications can be enhanced through better algorithms and systems architectures,

parallel machines with significantly higher performance can be built without incurring the high cost of

expensive networks [13].

One of the interesting publications is the performance analysis of the supercomputers with different

interconnection topologies [14]. It was actually in days when the article was written. Author measured

and grid topologies on the SUPRENUM supercomputer. Those measurements utilized the original 4-

ring interconnection topology.

Author presents the measured result in the Table 1. The table indicates the number of processors P, their

arrangement as a logical Px ×Py rectangular processor array, the computational domain size Mx ×My,

the resulting computational efficiency (comparison to P times the single node performance) and the

Mflops generated.

Each node contains a 32×1024 grid and the processors are arranged in a line parallel to the Y direction.

The peak performance is seen to be 1272 Mflops.

Table 1. Processor grid aligned with Y axis (double matrix) [14]

He compares the effect of varying the shape of the processor grid for 256 node computations Table 2.

The almost square 4096×2048 grid on a 128×2 processor array is seen to deliver 1042 Mflops.

Table 2. 256-node performance as function of processor grid (full connect) [14]

I can make conclusion of this research that a grid topology has different efficiency and performance

with different characteristics of the grid.

Also I meet articles about possibility to achieve high performance in a supercomputer. One of the

articles was presented by Sotirios G. Ziavras [15]. He said that the direct binary hypercube has been

one of the most popular interconnection networks because it provides small diameter, and is so robust

that it can simulate very efficiently a wide variety of other frequently used interconnection networks.

Newetheless, a major drawback of the hypercube is the dramatic increase in the total number of

communication channels with any increase in the total number of processors in the system.

9

This drawback has significant impact on the very large scale integration (VLSI) complexity of

hypercube systems. For this reason, existing commercial hypercube systems have either a large number

10

of primitive processors (for example, the Connection Machine system CM-2 has 65,536 simple

processors) or a relatively small number of powerful processors (for example, the nCUBE with 1,024

processors). These systems, however, cannot achieve 1 Teraflop performance; the development of

technologies for the construction of massively parallel systems with thousands of processors that could

achieve 1 Teraflop peak performance. This performance goal could be achieved in this decade only

with systems containing thousands of powerful processors [15].

In those days several hypercube variations have been proposed in the literature. Several of them were

developed to improve the topological properties of the hypercube. It is well-known that the hypercube

does have very good topological properties, therefore, the viability of the latter variations is

questionable; in fact, they often increase its already high VLSI complexity! Other variations employ

hypercube building blocks that contain a special communication processor. Building blocks are

connected together through their communication processors. Not only are such structures irregular but

for large building blocks the employment of the communication processors results in substantial

bottlenecks.

In contrast, author has introduced the family of reduced hypercube (RH) interconnection networks that

have smaller VLSI complexity than hypercubes with the same number of processors. RHs are produced

from regular hypercubes by a uniform reduction in the number of channels for each node [15].

Extensive comparison of RH interconnection networks with conventional hypercubes of the same size

has proven that they achieve comparable performance. It has been also shown that any RH can simulate

simultaneously in an optimal manner several popular cube-connected cycles interconnection networks.

Additionally, techniques have been developed for the efficient embedding of linear arrays,

multidimensional meshes, and binary trees into RHs. Finally, generalized reduced hypercube

interconnection networks have been introduced for even more versatility.

I have described opinion about most popular interconnection network till 1994 year. I will present

comparisons from 1994 till nowadays in the next subchapter.

3.1. In the scientific publications after 1994 till nowadays

The article [16] compares and evaluates the multicast performance of two System-Area Networks

(SANs), Dolphin’s Scalable Coherent Interface (SCI) and Myricom’s Myrinet. Both networks deliver

low latency and high bandwidth to applications, but do not support multicast in hardware. Authors

compared SCI and Myrinet in terms of their user level performance using various software-based

multicast algorithms under various networking and multicasting scenarios. The recent Dolphin SCI

networks are capable of achieving low latencies (smaller than 2µs) and high throughputs (5.3 Gbps

peak link throughput) over point-to-point links with cut-through switching. SCI architecture is

presented on the Picture 2. Using the unidirectional ringlets as a basic block, it is possible to obtain a

large variety of topologies, such as counter-rotating rings and unidirectional and bidirectional torus

[16].

Picture 2. Architectural block diagram of SCI NIC (2D) [16].

Myrinet provides low latencies (as low as about 6µs) and high data rates (up to 2.0 Gbps). Picture 3

shows the architectural block diagram of a PCI-based Myrinet NIC.

11

Picture 3. Architectural block diagram of Myrinet NIC [16].

For the Dolphin SCI experiments, Dolphin PCI-64/66/D330 SCI NICs with 5.3 Gb/s link speed are

used along with Scali’s SSP (Scali Software Platform) 3.0.1 communication software. The nodes are

interconnected to form a 4x4 unidirectional 2D torus using 2m cables. The Myrinet experiments are

performed using M2L-PCI64A-2 Myrinet NICs with a 1.28 Gb/s link speed along with Myricom’s

GM-1.6.4 communication software. The nodes are interconnected to form a 16-node star network using

16 3m Myrinet LAN cables and a 16-port Myrinet switch for the Myrinet experiments [16].

The result is shown on the Picture 4.

12

Picture 4. Large-message multicast completion latency.

The conclusion of the article is that the analysis of most widely deployed interconnects for cluster

computing takes different results. For example the SCI separate addressing algorithm is found to be the

best choice for small-message. For large messages, Dolphin’s SCI has a clear advantage compared to

Myricom’s Myrinet (i.e. 5.3 Gb/s of SCI vs. 1.28 Gb/s of Myrinet used in this experiment). Although

the newest Myrinet hardware features higher data rates (for example 2.0 Gb/s) than tested, these rates

are still significantly lower than SCI.

The article [17] also compared SCI and Myrinet. In their study, they concluded that, in terms of their

performance and robustness analyses, Myrinet is a better choice compared to SCI.

One of the articles was about bandwidth and latency analysis on a Cray XT4 system [18]. It is

registered with communication bandwidth point-to-point ≤ 7.6 GB/s and bisectional/cabinet 667 GB/s

into top500 list.

The Cray XT4 interconnection network is arranged in a 3-D mesh topology (some systems employ

wraparound links, producing a torus in some or all dimensions). Each node in the mesh is nominally

connected to its six nearest neighbors by point-to-point communication links. Normally, these links are

configured at boot time by the Cray route manager to operate at full speed. [18].

13

Picture 5. MPI ping-pong bandwidth and latency for the three supported link bandwidth modes [18].

Three degraded modes are supported: 1/4 bandwidth, 1/2 bandwidth, and 3/4 bandwidth. Picture 5

shows the resulting MPI uni-directional bandwidth and latency for the degraded modes, as well as full

bandwidth. The 3/4 and full link bandwidth results are limited by the compute node’s injection

bandwidth which is lower than the network link bandwidth. Based on the 1/4 bandwidth results, the

effective link bandwidth is approximately 3092 MB/s in each direction. Small-message MPI latency is

virtually unchanged for the degraded modes, indicating that link bandwidth can be controlled

independently of latency [18].

Picture 6 shows MPI ping-pong bandwidth and latency for the four tested HyperTransport link

configurations: 200-MHz 8-bit (400 MB/s per direction), 200-MHz 16-bit (800 MB/s), 400-MHz 16-bit

(1600 MB/s), and 800-MHz 16-bit (3200 MB/s)

Picture 6. MPI ping-pong bandwidth and latency for the four link configurations [18].

On the author’s opinion there were there several network interconnects in 2003 year that provide low

latency (less than 10µs) and high bandwidth (multiple Gbps). Two of the leading products are Myrinet

14

15

and Quadrics. In the high performance computing area, MPI has been the de facto standard for writing

parallel applications. To achieve optimal performance in a cluster, it is very important to implement

MPI efficiently on top of the cluster interconnect. Myrinet and Quadrics were designed for high

performance computing environments. As a result, their hardware and software are specially optimized

to achieve better MPI performance [19].

More recently, InfiniBand has entered the high performance computing market. Unlike Myrinet and

Quadrics, InfiniBand was initially proposed as a generic interconnect for inter-process communication

and I/O. However, its rich feature set, high performance and scalability make it also attractive as a

communication layer for high performance computing [19].

Authors present a comprehensive performance comparison of MPI implementations over InfiniBand,

Myrinet and Quadrics.

InfiniBand platform consisted of InfiniHost HCAs and an InfiniScale switch from Mellanox.

InfiniScale is a full wire-speed switch with eight 10 Gbps ports. The InfiniHost MT23108 HCA

connects to the host through PCIX bus. It allows for a bandwidth of up to 10 Gbps over its ports.

Myrinet network consists of Myrinet M3F-PCIXD-2 network interface cards connected by a Myrinet-

2000 switch. The link bandwidth of the Myrinet network is 2Gbps in each direction. The Myrinet-2000

switch is an 8-port crossbar switch. The network interface card has a 133MHz/64bit PCI-X interface. It

has a programmable LANai-XP processor running at 225 MHz with 2MB onboard SRAM. The LANai

processor on the NIC can access host memory via the PCI-X bus through the DMA controller.

Quadrics network consists of Elan3 QM-400 network interface cards and an Elite 16 switch. The

Quadrics network has a transmission bandwidth of 400MB/s in each link direction.

Picture 7 shows the MPI-level latency results. The test is conducted in a ping-pong fashion and the

latency is derived from round-trip time. For small messages, Quadrics shows excellent latencies, which

are around 4.6µs. The smallest latencies for InfiniBand and Myrinet are 6.8µs and 6.7µs, respectively.

For large messages, InfiniBand has a clear advantage because of its higher bandwidth [19].

Picture 7. MPI Latency across Three interconnects [19].

Picture 8 shows the bi-directional bandwidth results. We can see that InfiniBand bandwidth increases

from 841MB/s uni-directional to 900MB/s. Then it is limited by the bandwidth of the PCI-X bus.

Quadrics bandwidth improves from 308MB/s to 375MB/s. Myrinet shows even more improvement. Its

peak bandwidth increases from 235MB/s to 473MB/s. However, Myrinet bandwidth drops to less than

340MB/s when the message size is larger than 256KB.

Picture 8. MPI BI-Directional bandwidth [19].

Conclusion shows that although InfiniBand is able to deliver very good performance at the MPI level.

Results on the 8 node OSU cluster and 16 node Topspin cluster also show that InfiniBand has very 16

good scalability. InfiniBand can still outperform other interconnects if the application is bandwidth-

bound.

The article that was presented by DELL [20]. The lab set up a test environment that comprised a cluster

of 32 identically configured Dell PowerEdge 1750 servers inter connected with Gigabit Ethernet and

InfiniBand HCAs and switches.

Each PowerEdge 1750 server had two Intel Xeon processors running at 3.06 GHz with 512 KB level 2

(L2) cache, 1 MB level 3 (L3) cache, and 4 GB of DDR RAM operating on a 533 MHz front-side bus

(FSB). The chip set of the PowerEdge 1750 was the ServerWorks GC-LE, which accommodates up to

four registered DDR PC2100 (266 MHz) DIMMs with two-way interleaved memory architecture. Each

PowerEdge 1750 server was also equipped with two integrated Broadcom Gigabit2 Ethernet adapters,

dual 64-bit 133 MHz PCI-X slots for the interconnects, and the appropriate NICs. The 32 PowerEdge

1750 servers functioned as the compute nodes of the cluster. An Extreme Networks Summit 400-48t

switch with 48 non-blocking Gigabit Ethernet ports was used for Gigabit Ethernet tests. The InfiniBand

components consisted of two Topspin 120 InfiniBand switches and Topspin InfiniBand 10 Gbps PCI-X

HCAs. Each of the Topspin 120 switches was connected to 16 nodes and to each other with eight

interswitch links between the two switches.

Picture 9. Latency comparison between Gigabit Ethernet and InfiniBand [20].

17

Picture 10. Bandwidth comparison between Gigabit Ethernet and InfiniBand [20].

Picture 9 shows that the latency for the InfiniBand interconnect is fairly low compared to that of the

Gigabit Ethernet interconnect in this study. For a small message size, InfiniBand has a latency of

approximately 7 microseconds (μs) as compared to Gigabit Ethernet, which shows a latency of

approximately 65 μs for the same message size. Picture 10 shows the bandwidth obtained in this study.

Gigabit Ethernet peaks out at roughly 85 MB/sec to 90 MB/sec as compared to InfiniBand, which

shows a peak of approximately 700 MB/sec for large message sizes. The results in this study were

obtained using PCI-X interfaces for both InfiniBand and Gigabit Ethernet on separate PCI-X buses.

Another approach based on price of the supercomputer to achieve exact level of performance. The

result after testing is shown on the Table 3.

Table 3. Comparison between Gigabit Ethernet and InfiniBand Clusters from Nov2003 Top500 list of

Computers [21].

In this case, the price/performance equation is simple since InfiniBand switch ports, at less than $300

per port, are far more cost effective than buying servers costing several thousand dollars or more.

18

Table 4. Comparison of InfiniBand and GE / 10GE Switches

Table 4 shows that Gigabit Ethernet and 10 Gigabit Ethernet are more expensive interconnection

networks than Infiniband. At the same time, Infiniband provides more performance.

4. TOP500.org statistics

Top500.org has listed the most used interconnection families in 500 most powerful supercomputers in

the world. Table 1 describes interconnect family, count, percentual share, sum of maximum ranges, sum

of peak ranges and total amount processors which are connected to a interconnect network family.

Picture 11 and 12 present graphically percentual share of different interconnection networks listed in

top500.org.

Picture 11. Interconnect share for 11/2008 part 1.

19

Picture 12. Interconnect share for 11/2008 part 2.

The table reveals that Gigabit Ethernet and Infiniband are clearly the most popular interconnection

networks in current supercomputers. Over 50% of them are using Gigabit Ethernet and Infiniband’s

share is almost 30%. Table 3 shows a total performance for supercomputers based on used

interconnection family. [9] For an interconnection network comparison purposes, we will calculate

interconnects average performance in a system and interconnect network’s average performance per

processor.

Table 5. Average performance for supercomputers based on used interconnection family.

Interconnect family Rmax Sum/Count (GF/system)

Rpeak Sum/Count (GF/system)

Rmax Sum/Processor Sum

(GF/processor)

Rpeak Sum/Processor Sum

(GF/processor)

Average amount of processors in a system

Myrinet 35029 48893 6 8.64 5658

Quadrics 30555 36877 5.8 7 5260

Gigabit Ethernet 17565 34780 5.25 10.4 3344

Infiniband 46434 61833 7.8 10.4 5962

Crossbar 35860 40960 7 8 5120

Mixed 66567 82944 4.8 6 13824

NUMALink 40851 45875 5.7 6.4 7168

SP Switch 22954 27375 6.7 8 3420

20

Proprietary 98644 124853 3.7 4.7 26385

Cray Interconnect

59866 78245 4.9 6.4 12167

According to Table 5, it seems when Infiniband interconnection network have been used, have the best

average performance relative to number of processors been achieved. Average maximum performance

per system states that the computers that use Infiniband, are also quite powerful. Infiniband is used in

142 supercomputers and there are also smaller clusters included. Supercomputers that use Gigabit

Ethernet are less powerful at average but on the other hand, Gigabit Ethernet is currently the most used

interconnection network and probably quite popular solution for smaller clusters as well. Proprietary

interconnection networks are used in 42 supercomputers. They seem to be the most powerful at

average, but it is easy to understand that when some type of special solution is applied, the computer is

supposed to be expensive and powerful.

Table 4. Approximate bandwidths and latencies of interconnection networks [10].

Top500.org has listed approximate bandwidths and latencies of interconnection networks to Table 4.

Gigabit Ethernet is available with maximum theoretical bandwidth of 125 MB per second. Gigabit

Ethernet in-switch latencies are approximately 30-40 µs [10]. This could be suitable configuration for

some applications that are not latency-bound at all [10]. Otherwise network choice should be somewhat

different.

When using high bandwidth interconnection networks, should a processing computer be able to deliver

21

22

data as fast as the network. Available bandwidths of PCI lane are 110-480 MB/s. PCI-X supports

bandwidth of 1 GB/s and PCI-X v2 supports also bandwidths of 2 and 4 GB/s. The recent PCI-Express

has multiple data transfer lanes which bandwidths are 200MB/s. PCI-Express supports x2, x4, x8, x16

or x32 data lanes [10].

5. Summary

As a conclusion I can mark that there is two most used interconnection topologies: Infiniband and

Gigabit Ethernet in the supercomputer implementations. We can make this conclusion also based on

scientific articles. Gigabit Ethernet takes first place in utilization by virtue of the fact that it is cheaper

and simply to implement.

23

References

[1] Grama Ananth, Gupta Anshul, Karypis George, Kumar Vipin: Introduction to Parallel Computing,

Second Edition, Addison-Wesley, 2003, ISBN 0-201-64865-2

[2] http://www.top500.org/2007_overview_recent_supercomputers/glossary_terms, actually 20/11/2008

[3] Sotitios G. Ziavras, “RH: A Versatile Family of Reduced Hypercube Interconnection Networks”,

1994.

[4] http://www.top500.org/2007_overview_recent_supercomputers/myrinet, actually 20/11/2008

[5] http://www.top500.org/orsc/2006/infiniband, actually 20/11/2008

[6] http://www.top500.org/orsc/2006/qsnet, actually 20/11/2008

[7] http://www.top500.org/orsc/2006/sci, actually 20/11/2008

[9] http://www.top500.org/stats/list/32/connfam, Interconnect Family share for 11/2008, actually

21/11/2008

[10] http://www.top500.org/2007_overview_recent_supercomputers/networks, Overview of Recent

Supercomputers, actually 18/11/2008

[11] V. Lahtinen, E. Salminen, K. Kuusilinna, T. Hamalainen, “Comparison of synthesized bus and

crossbar interconnection architectures.”, Tampere, Finland.

[12] Howard Jay Siegel, “Panel 2:Is It Possible to Fairly Compare Interconnection Networks?”, USA.

[13] Anant Agarwal, “Limits on Interconnection Network Performance.”, USA.

[14] Oliver A. McBryan. “Perfomance of the SUPRENUM-1 supercomputer with the full bus

interconnection network.”, USA.

[15] Sotirios G. Ziavras, “Reduces hypercube interconnection networks for massively parallel

computers.”

[16] Sarp Oral and Alan D. George, “A User-level Multicast Performance Comparison of

Scalable Coherent Interface and Myrinet Interconnects.” USA.

[17] M. Fischer, U. Brüning, J. Kluge, L. Rzymianowicz, P. Schulz, and M. Waack, “ATOLL a new

24

switched, highspeed Interconnect in Comparison to Myrinet and SCI,”, Mexico, 2000.

[18] Kevin T. Pedretti, Courtenay Vaughan, K. Scott Hemmert, and Brian Barrett, “Application

Sensitivity to Link and Injection Bandwidth on a Cray XT4 System”, Finland, 2008.

[19] Jiuxing Liu, Balasubramanian, Chandrasekaran Jiesheng, Wu Weihang Jiang, “Performance

Comparison of MPI Implementations over InfiniBand, Myrinet and Quadrics”, 2003.

[20] Dell Power Solutions, “Exploring InfiniBand as an HPC Cluster Interconnect”, 2004.

[21] Mellanox Technologies Inc., “InfiniBand Clustering Delivering Better Price/Performance than

Ethernet”, USA