high-performance interconnection networks: a key...

HIGH-PERFORMANCE INTERCONNECTION NETWORKS:A KEY COMPONENT IN FUTURE SYSTEMS

Jose Duato and Federico Silla

Technical University of Valencia, [email protected], [email protected]

1. INTRODUCTION

Research in academia usually focuses on rather narrow topics. In the case of computer architecture, such isolated researchareas are processor microarchitecture, memory hierarchy, cache coherence protocols, and interconnection networks, for ex-ample. As a consequence of focusing on a particular research area, even when radically new solutions are proposed (e.g.a cost-effective fully adaptive routing algorithm), those solutions only improve the subset of the system targeted by thatparticular research and therefore do not eliminate the inefficiencies that are a direct consequence of the system architecture.Therefore, those solutions may not be globally optimal. For instance, too many resources or power budget are sometimesdevoted to improve a component that is not the system bottleneck.

In order to avoid these kind of limitations a global system view is required even when addressing problems in a particularsubsystem. When looking at computer systems from a global perspective, researchers start (or should start) by looking atapplication requirements. However, there is a fundamental flaw in this approach: existing applications were designed forexisting computer systems. Additionally, because new computer systems are designed to run existing benchmarks faster, weenter a process that goes round and round in circles where, in the global optimization phase, we loose the opportunity toreplace the existing programming model and style. We may even end up proposing techniques to recover parallelism thathas been lost due to previous optimizations.

In this paper we show that the ingredients to build high-performance architectures at low cost are ready but, unfortu-nately, what technology likely requires to become mature is a shift in the way high-end systems are envisioned. A globalperspective that considers current market trends as well as current technology is needed. We also show that high-performanceinterconnection networks are crucial when this shift takes place.

2. THE USE OF COMMODITY COMPONENTS

Despite of the circular process mentioned in the previous section, the architecture of high-performance computing systems hassignificantly evolved over time. More specifically, almost 15 years ago, a new trend started: using commodity components tobuild cost-effective high-performance clusters. Effectively, in late 1993, Donald Becker and Thomas Sterling began sketchingthe outline of a commodity-based cluster system designed as a cost-effective alternative to large supercomputers. In early1994, working at CESDIS under the sponsorship of the HPCC/ESS project, the Beowulf Project [9] was started. Beowulfclusters are scalable performance clusters based on commodity hardware and on a private system network with open sourcesoftware (Linux) infrastructure.

A number of factors have contributed to the growth of Beowulf class computers [10]:

• The prevalence of computers for office automation, home computing, games and entertainment now provides systemdesigners with cost-effective components.

• The COTS (Commodity Off-The-Shelf) industry now provides fully assembled subsystems (microprocessors, mother-boards, disks and network interfaces).

• Mass market competition has driven prices down and reliability up.

• The availability of open source software, particularly Linux, GNU compilers and programming tools, and MPI andPVM libraries.

• Programs like the HPCC program have produced many years of experience working with parallel algorithms.

• Obtaining high performance, even from vendor-provided parallel platforms, is hard work and requires a do-it-yourselfattitude.

• An increased reliance on computational networkscience which demands high performance computing.

27

In the context of the Beowulf project, equally important to the performance improvements in microprocessors are thecost/performance gains in I/O and network technology. In fact, one decade ago, manufacturers realized that the I/Osubsystem was the main bottleneck in large servers and thus new interconnect standards were developed (e.g. PCI-X,InfiniBand [16], PCI-Express, HyperTransport) and some existing standards, like Fibre Channel [14], ramped up.

Figure 1 shows the evolution of high-performance computers toward commodity components. As can be seen, the fractionof computers in the Top500 list [19] that are based on clusters using commodity components has dramatically increased duringthe last few years.

Fig. 1. Evolution in the composition of the Top500 supercomputer list (image downloaded from Top500’s web site[19]).

In addition to using state-of-the-art commodity components for every new design, most of the recent improvements comefrom more compact form factors that provide smaller volume and weight, reduced cabling, shared power supply, cooling, andperipherals (e.g. DVD drive). Figure 2 depicts an example on this, showing a small cabinet composed of 10 units (referredto as blades), each of them being a four-way system.

Commodity components also have some disadvantages, derived mainly from inherited limitations. In the case of in-terconnects, for instance, network interfaces were designed for data communication, not for high-performance computing.Additionally, as Figure 3 shows, going through the north bridge, south bridge, PCI-E, and NIC drastically increases la-tency. Moreover, user-mode NIC access has serious limitations. The main consequence of these constraints is that currentcommodity interconnects are not aware of memory accesses by the processor(s). This forces programmers to use parallelprogramming models based on message passing. Additionally, current commodity interconnects deliver high bandwidth butlatency is also high, forcing programmers to accumulate data and send long messages to amortize the communication startuplatency. Finally, data accumulation leads to coarse-grain communication and reduces the amount of parallelism that can beexploited. Scalability is seriously affected, drastically limiting the number of processors on which a parallel application canbe efficiently executed. Current clusters suffer from these penalties. However, this could dramatically change if the rightshift in the way high-end systems are designed takes place, as shown in the next sections.

3. CURRENT TRENDS AND SYNERGIES

Before considering how parallel machines could be in the near future, let’s analyze the current state of the computer industry.There are three major trends that may dramatically affect how future systems will be designed. These trends are the multi-core race, parallel programming, and the use of accelerators.

Three years ago, commodity processors faced the heat dissipation wall and forced microprocessor manufacturers to moveto multi-core chips, replacing uniprocessor chips by multi-core processor chips and achieving much less power consumptionfor the same peak computing power. The multi-core race started. Manufacturers keep increasing the number of cores per

José Duato and Federico Silla

28

Fig. 2. Four-way blade and a cabinet composed of ten of these blades (image downloaded from Sun’s web site [18]).

chip at a slow rate. However, many users do not know what to do with additional cores because multi-core chips can nolonger exploit parallelism in an automatic way and, therefore, applications should be multithreaded to take advantage of theadditional cores. Note that up to now it has been quite easy to convince desktop and laptop users that a second core isbeneficial even for running single-thread applications. The argument has always been to run the antivirus and firewall onthe second core. But when four-core processors are used, users face the problem of what to do with the two extra cores.In order to efficiently use the additional cores, applications need to be rewritten. Simpler programming models will likelybecome much more widespread than more sophisticated ones. For example, shared memory is easier to program than messagepassing. Additionally, architectural details must be hidden from application programmers because not doing so may makeit more difficult to accept a given architecture. This could be the case, for instance, of the PlayStation 3 [20], which ismore difficult to program than the XBox 360 [21], despite being more powerful. Additionally, multicores will soon face thememory bandwidth wall problem, showing that feeding all the cores is a complex task. This will be aggravated when runningapplications that do not share data (e.g. multiple virtual servers) and/or when including the graphics accelerator in the samechip.

With respect to accelerators, these devices can execute repetitive compute-intensive functions much faster than hostprocessors. Different flavors have been designed: GPU-based accelerators [17], FPGA-based accelerators [13], SIMD-basedaccelerators [11], etc. Although accelerators are not good for code fragments with high memory bandwidth requirements,unless the accelerator implements a large and fast local memory (e.g. graphics cards), they are becoming popular due to theavailability of compilers and programming tools.

The three trends mentioned above are evolving in such a way that some synergies are appearing. First, mass-marketapplications that require more computing power (e.g. video games) are forcing application developers toward parallel pro-gramming. This will make that the number of programmers able to develop multi-threaded applications will very likelyincrease at a fast pace during the next few years, what will likely make parallel programming more popular. However,because most of these application developers will become familiar with shared-memory models but not so much with message

HIGH-PERFORMANCE INTERCONNECTION NETWORKS

29

Fig. 3. Traditional chipset architecture in current PCs.

passing, shared-memory machines will likely be preferred over current clusters. Therefore, this should be taken into accountwhen designing future systems.

4. FEASIBLE FUTURE SYSTEM ARCHITECTURES

In this scenario of commodity components, where parallel programming will likely be more popular using shared-memorymachines, chip architecture and system architecture have become much more relevant than core microarchitecture becauseprocessor cores are now commodity components. In this way, given a processor core, system designers should answer questionssuch as how many cores to include in a single chip and how to interconnect them (i.e. on-chip networks), whether to usehomogeneous or heterogeneous cores, which should be the cache hierarchy (how many levels, private vs. shared), whichis the memory organization (local vs. shared, hardware coherence vs. software coherence vs. coherence domains vs. non-coherent), what are the pin bandwidth constraints, where do we attach network interfaces, and also what kind of storage touse (traditional hard disks vs. solid state disks vs. non-volatile memory (i.e. FLASH)).

One of the possibilities a system architect may think about is to design large-scale cc-NUMA architectures as the onedepicted in Figure 4. These architectures are based on the idea of using physically distributed, logically shared memory.These systems provide a quite simple programming model. Caches are mandatory to deliver good performance, which isincreased when locality is properly exploited. Hardware takes care of coherence. However, keeping caches coherent in largesystems is a nightmare because broadcasting protocols do not scale and protocols based on directories do not scale wellbecause directory memory may be larger than main memory for 1024 nodes and larger. Moreover, directories introduceadditional latency when accessing memory and also fault containment becomes extremely difficult.

Many techniques have been proposed to make scalable directories [1] and a few of them have been implemented. Someof these techniques are:

• Coarse-grain bit vectors: Reduction by a constant factor, still linear growth.

• Limited directories: Logarithmic growth, quite effective, but terrible performance if broadcasts are frequent.

• Linked lists (linear and trees): Logarithmic growth, very compact, but too slow when the system becomes large.

• Logical tree level with implicit reference node: The most compact scheme (log log N), but generates excessive trafficin the network.


30

Fig. 4. cc-NUMA architecture.

• Logical tree level with explicit reference node: Logarithmic growth (log N + log log N), very compact solution andquite effective, but still generates a lot of traffic in the network.

Multiple combinations of the techniques above are possible but either directories require too much memory, access is tooslow, or protocols generate too much traffic in the network.

There are some additional techniques to reduce latency [2, 3]. For instance, techniques that predict cache line sharers andtherefore convert a 3-hop protocol into a 2-hop protocol and remove directory access from the critical path. Another techniqueuses multi-level directory caches, including an on-chip level. Additionally, low-latency interconnects are particularly usefulwhen combined with on-chip directory caches. However, none of these techniques has been evaluated for very large systems.

A possibly better shared-memory system design choice is using non-coherent globally shared memory. A lot of researcheffort on distributed shared-memory systems with non-coherent caches has been developed, aiming at increasing systemscalability. In the past, a huge body of research on Virtual Shared Memory (VSM) architectures, aiming at simplifyingprogramming models on networks of workstations, was carried out. However, implementations are usually quite inefficient.Recently a new trend appeared consisting of multiple coherence domains, motivated by virtual servers becoming widespread.

Transactional memory [6, 12] may simplify the use of non-coherent globally shared memory. The most difficult task whendeveloping multithreaded applications is making sure that the program works (e.g. deadlocks may occur when combiningcorrect code fragments). Transactional memory is a concurrency control mechanism for controlling access to shared memory.A transaction is a piece of code that executes a series of reads and writes to shared memory, which logically occur at asingle instant in time, and are typically implemented in a lock-free way. Transactional memory is optimistic: every threadcompletes its modifications to shared memory without regard for what other threads might be doing, recording every readand write that it makes in a log, which are validated in the commit stage. Implementing part of the system memory astransactional memory could be the solution for storing shared data in parallel applications while simplifying programming.

Another ingredient that system designers should take into account for future system architectures is accelerators. Com-puting power can be enhanced by using accelerators, which are a hot topic nowadays, as mentioned in the previous section.In a large system, virtual interfaces to the accelerator(s) are required to allow multiple processors to share a single acceleratorin an effective way, leading to heterogeneous systems in which processors and accelerators are seamlessly interconnected, al-lowing direct communication among accelerators. This direct communication among them is necessary in order to minimizememory bandwidth requirements. In this way, data processed by one accelerator may be directly sent to another one inorder to continue its processing, as in a systolic array [7]. In fact, by enabling direct communication among accelerators,modular systolic arrays could be used to build larger systems. On-chip systolic arrays [8], as the one shown in Figure 5, maybe conceived to minimize memory bandwidth requirements. Systolic arrays could be the answer to HPC under limited pinbandwidth as current VLSI technology allows the implementation of an entire array within one chip, and each chip can be amodule of a larger array as mentioned above.

In summary, cc-NUMA architectures are expensive and not very scalable. On the contrary, non-coherent shared-memoryarchitectures as well as shared-memory architectures with multiple coherence domains are feasible. Therefore, a feasible,scalable, flexible and cost-effective approach for future systems is using a global address space, not necessarily coherent,where each page has configurable semantics (private, coherent, non-coherent, transactional). Additionally, accelerators canplay a vital role in increasing computing power and reducing power consumption.


31

Fig. 5. On-chip systolic array.

5. A KEY SUBSYSTEM: THE INTERCONNECT

As the system model described above, shared-memory multiprocessors, provides fine-grain parallelism, it requires fine-graincommunication. Therefore, we can no longer send long messages to amortize communication startup latency. Thus, thecommunication subsystem should be designed in such a way that the startup latency disappears. Moreover, as the network isnow in the critical path when accessing remote memory, a fast interconnection network will enable the exploitation of higherdegrees of concurrency, thus increasing scalability. This fast interconnection network requires some features [5]:

• Low latency: The primary goal in most multiprocessors since application execution time directly depends on it. Thislow latency could be achieved through on-chip network interfaces and pipelined switching techniques.

• High throughput: If the network cannot handle all the traffic generated by the applications, congestion arises andlatency increases by orders of magnitude.

• Low cost: Commodity components vs. ad-hoc designs.

• Low power consumption: Links consume more than 50% of the network power.

• High reliability: Most networks provide alternative paths that can be exploited.

• Provide service guarantees: Mostly for sharing a physical machine among virtual ones.

Regarding latency, software overhead is the most important contributor to latency in current message-passing systems. Itcan be eliminated by making hardware (e.g. memory controller) automatically generate messages. Moreover, going frommemory through the north bridge, the south bridge (e.g. the PCI-Express bridge), and the network interface memory,drastically increases latency, which could therefore be reduced by integrating network interfaces in the processor chip (e.g.HyperTransport [15]). In addition to the delay introduced by these components, packets are not usually pipelined throughthe network interface and the switch fabric. Finally, protocol conversions between standards may add to latency.

Regarding throughput, note that network links, arranged according to some topology, provide the raw network band-width, but there exist some additional well-known techniques to improve throughput, such as link-level and network-levelprotocols that deliver the required functionality to transmit packets while trying to minimize extra bandwidth consump-tion, switching techniques that try to minimize latency through switches while making an efficient use of link bandwidthand buffer space, adaptive routing algorithms and load balancing techniques that try to balance link utilization to delaysaturation and maximize throughput, and congestion management mechanisms that prevent performance degradation whenentering saturation, thus making it safe to work at high loads.


32

6. CONCLUSIONS

The use of commodity components has been the key to deliver tremendous computing power at an affordable cost (e.g.clusters). However, parallel architectures based on current commodity components have intrinsic limitations that preventefficient exploitation of parallelism.

Current multi-core trends combined with mass market highly demanding applications (e.g. computer games) will force therapid expansion of shared-memory parallel programming. Computer industry should use this unique opportunity to designscalable, cost-effective shared-memory architectures. Low-latency, high-bandwidth interconnects are the key subsystem toenable the design of scalable shared-memory architectures. Several efficient solutions exist for different subsystems, includinginterconnects [4, 5]. What remains to be done is finding the right combination of components that will enable those high-performance architectures at low cost.

7. REFERENCES

[1] M. E. Acacio, J. Gonzalez, J. M. Garcia, J. Duato, A Two-Level Directory Architecture for Highly Scalable cc-NUMAMultiprocessors, IEEE Transactions on Parallel and Distributed Systems, v. 16, n. 1, p.67-79, January 2005.

[2] M. E. Acacio, J. Gonzalez, J. M. Garcia, J. Duato, A Novel Approach to Reduce L2 Miss Latency in Shared-MemoryMultiprocessors, 16th Int’l Parallel and Distributed Processing Symposium (IPDPS’02), April 2002.

[3] M. E. Acacio, J. Gonzalez, J. M. Garcia, J. Duato, The Use of Prediction for Accelerating Upgrade Misses in cc-NUMAMultiprocessors, 11th Int’l Conference on Parallel Architectures and Compilation Techniques (PACT 2002), September2002.

[4] W. Dally and B. Towles, Principles and Practices of Interconnection Networks, Morgan Kaufmann Publishers Inc., 2003.

[5] J. Duato, S. Yalamanchili, and Lionel Ni, Interconnection Networks: an Engineering Approach, Morgan KaufmannPublishers Inc., 2002.

[6] T. F. Knight, An Architecture for Mostly Functional Languages, Proceedings of ACM Lisp and Functional ProgrammingConference, pp. 500-519, Aug 1986.

[7] H.T. Kung and C.E. Leiserson, Systolic Arrays (for VLSI), Sparse Matrix Proceedings, 1978. Also in Society for Industrialand Applied Mathematics, pp. 256-282, 1979.

[8] W. Music, A Systolic Array Implementation Using FPGAs, COTS Journal,http://www.cotsjournalonline.com/home/article.php?id=100249

[9] http://www.beowulf.org

[10] http://www.beowulf.org/overview/history.html

[11] http://www.clearspeed.com

[12] http://www.cs.wisc.edu/trans-memory/biblio/index.html

[13] http://www.drccomputer.com

[14] http://www.fibrechannel.org

[15] http://www.hypertransport.org

[16] http://www.infinibandta.org

[17] http://www.nvidia.com/object/tesla computing solutions.html

[18] http://www.sun.com

[19] http://www.top500.org

[20] http://www.us.playstation.com/PS3/About/TechnicalSpecifications

[21] http://www.xbox.com


33

high-performance interconnection networks: a key...

Documents