the supercomputer supernet (ssn): a high-speed electro ... · bone network supporting multzple...

12
The Supercomputer Supernet (SSN): A High-Speed Electro-Optic Campus and Metropolitan Network* Nicholas Bambos, Joseph Bannister, Larry Bergman, Jason Gong, Eli Gafni, Mario Gerla, Leonard Kleinrock, Steve Monacos, Po-Chi Hu, B. Kannan, Bruce Kwan, Prasasth Palnati, John Peck and Simon Walton Computer Science Department University of California, Los Angeles CA 90095-1596. Abstract The Supercomputer Supernel (SSN} is a high- performance, scalable optical interconnection nelwork I or supercompulers and worksaiion cMsers based on asynchronous, wormhole-rouing switches. The WDM oplical backbone extends the geographic cover- age range from inerdepartmenal io campus and even o mefropol2an areas w2h dynamically reconfigurable direct or muUi-hop connections. The neiwork pro- vides very high-speed integrated services, supporling connedion oriented, guaranteed bandwidth fraffic as well as daagram Iraffic. The firsi networking level of The iwo-level SSN archiiecure is electronic and con- sists of crossbar meshes locally interconnecting work- stations, supercomputers, peripheral devices and mass memory. At a higher networking level, an optical back- bone network supporting multzple wavelength division multiplexed (WDM) channels allows communication between devices connected to distinct crossbar meshes. In this paper, we focus on the protocols of the WDM optical backbone network and address the issue of in- tegration of the electronic wormhole-routing LAN with the optical backbone. Keywords: Supercomputer interconnection network, op- tical WDM network, Wormhole routing. 1 Introduction The Supercomputer Supernet (SSN) currently be- ing developed at UCLA, JPL and Aerospace under ARPA support is a novel, high-performance, scal- *This work was supported by the USDOD ARPA/CSTO un- der Contract DABT63-93-C-0055. The Distributed Supercom- puter Supernet — A Multi Service Optical Intelligent Network. able optical interconnection network for supercomput- ers and workstation clusters based on asynchronous wormhole routing crossbar switches. SSN has a two- level architecture in which high-speed electronic LANs are interconnected via a wavelength division multi- plexed (WDM) optical network. The high-speed elec- tronic LAN uses crossbars to interconnect worksta- tions, supercomputers, peripheral devices and mass memory. The high-speed LAN employs wormhole routing and backpressure flow control to provide a low-latency, high-bandwidth connection over limited distance (maximum link length is 25m). The WDM fiber optic backbone network extends the geographic range from interdepartmental to campus and even to metropolitan areas via dynamically reconfigurable single-hop or multi-hop connections. Distributed supercomputing and cluster computing impose critical demands on the network. These in- dude low-latency, high-bandwidth connections, sup- port for guaranteed bandwidth service and mecha- nisms to deal with unbalanced traffic patterns. The high-speed electronic LAN employs wormhole rout- ing to provide low-latency datagram traffic support. The WDM fabric, on the other hand. can efficiently support guaranteed bandwidth services because of the enormous bandwidth available. Thus, the challenge is to effectively integrate the high intelligence and flexi- bility of electronic switching with the high throughput and parallelism of optics. The main thrust of this pa- per is to show how the two-level SSN architecture ac- tually achieves this integration and provides support for distributed supercomputing and cluster comput- ing. 22 ISPIE Vol. 2692 O-8194-2066-2/96/$6.OO

Upload: others

Post on 14-Mar-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

The Supercomputer Supernet (SSN): A High-Speed Electro-OpticCampus and Metropolitan Network*

Nicholas Bambos, Joseph Bannister, Larry Bergman, Jason Gong,Eli Gafni, Mario Gerla, Leonard Kleinrock, Steve Monacos,

Po-Chi Hu, B. Kannan, Bruce Kwan, Prasasth Palnati, John Peck and Simon Walton

Computer Science DepartmentUniversity of California, Los Angeles CA 90095-1596.

AbstractThe Supercomputer Supernel (SSN} is a high-

performance, scalable optical interconnection nelworkI or supercompulers and worksaiion cMsers basedon asynchronous, wormhole-rouing switches. TheWDM oplical backbone extends the geographic cover-

age range from inerdepartmenal io campus and eveno mefropol2an areas w2h dynamically reconfigurabledirect or muUi-hop connections. The neiwork pro-vides very high-speed integrated services, supporlingconnedion oriented, guaranteed bandwidth fraffic aswell as daagram Iraffic. The firsi networking level ofThe iwo-level SSN archiiecure is electronic and con-sists of crossbar meshes locally interconnecting work-stations, supercomputers, peripheral devices and massmemory. At a higher networking level, an optical back-bone network supporting multzple wavelength divisionmultiplexed (WDM) channels allows communicationbetween devices connected to distinct crossbar meshes.In this paper, we focus on the protocols of the WDMoptical backbone network and address the issue of in-tegration of the electronic wormhole-routing LAN withthe optical backbone.Keywords: Supercomputer interconnection network, op-tical WDM network, Wormhole routing.

1 IntroductionThe Supercomputer Supernet (SSN) currently be-

ing developed at UCLA, JPL and Aerospace underARPA support is a novel, high-performance, scal-

*This work was supported by the USDOD ARPA/CSTO un-der Contract DABT63-93-C-0055. The Distributed Supercom-

puter Supernet — A Multi Service Optical Intelligent Network.

able optical interconnection network for supercomput-ers and workstation clusters based on asynchronouswormhole routing crossbar switches. SSN has a two-level architecture in which high-speed electronic LANsare interconnected via a wavelength division multi-plexed (WDM) optical network. The high-speed elec-tronic LAN uses crossbars to interconnect worksta-tions, supercomputers, peripheral devices and massmemory. The high-speed LAN employs wormholerouting and backpressure flow control to provide alow-latency, high-bandwidth connection over limiteddistance (maximum link length is 25m). The WDMfiber optic backbone network extends the geographicrange from interdepartmental to campus and evento metropolitan areas via dynamically reconfigurablesingle-hop or multi-hop connections.

Distributed supercomputing and cluster computingimpose critical demands on the network. These in-dude low-latency, high-bandwidth connections, sup-port for guaranteed bandwidth service and mecha-nisms to deal with unbalanced traffic patterns. Thehigh-speed electronic LAN employs wormhole rout-ing to provide low-latency datagram traffic support.The WDM fabric, on the other hand. can efficientlysupport guaranteed bandwidth services because of theenormous bandwidth available. Thus, the challenge isto effectively integrate the high intelligence and flexi-bility of electronic switching with the high throughputand parallelism of optics. The main thrust of this pa-per is to show how the two-level SSN architecture ac-tually achieves this integration and provides supportfor distributed supercomputing and cluster comput-ing.

22 ISPIE Vol. 2692 O-8194-2066-2/96/$6.OO

In Section 2, the supercomputer interconnection çn-vironment is reviewed with emphasis on supercom-puter applications as well as interconnection optionsboth in the optical and electronic domain. In Section3, the SSN architecture is described, beginning withthe Myrinet high-speed, wormhole-routing LAN. SSNbuilds on this basic LAN switching fabric by addingthe WDM fiber links and an Optical Channel Inter-face (OCI) controller which extends the range frombuilding to campus. The overall SSN network proto-col suite for datagram and stream services (i.e., data-gram transfer, flow control, routing, multicasting etc.)is described. In Section 4, representative performanceresults obtained via simulation are reported. The SSNtestbed for the LAN and MAN setting is described inSection 5. Section 6 concludes the paper.

2 The Supercomputer InterconnectionEnvironment

In this section, we present some representative ap-plications enabled by a supercomputer or workstationnetwork. The requirements of such applications arethen identified and candidate networking approachesin the electronic and optical domain are reviewed.

2.1 New Enabling ApplicationsJust as conventional LAN technology enabled such

present day applications as Network File System(NFS), the emerging low-latency, high-bandwidth(<lts, > 60Mbytes/s [MB/s]) workstation clusternetworks will likely enable new applications. Theseinclude:

• low cost arrayable video servers

• distributed memory parallel supercomputers (em-ploying fine grain process management on 100's-1000's of processor elements, distributed check-pointing of jobs, and dynamic entry of new hosts)

• network based memory management

• real-time data acquisition and processing systems(e.g., radar, missile tracking, remote robot con-trol)

• video display walls (with capability of perusingthrough huge database archives very quickly andinteractively)

One common characteristic of these applicationsis that they "extend" the RAM memory of any one

workstation to any other in the cluster. Such networkbased distributed memory permits large applicationsto run beyond the confines of any one machine's mem-ory space on a demand basis. This function has longbeen viewed as a minimum requirement to implementefficient message passing communications on MPP su-percomputers, but only recently has been made pos-sible over network interconnected workstations usinga new breed of low-latency (< lps) high-bandwidth(>6OMB/s) networks that match the memory band-width and responsiveness of workstations.

Adding WDM fiber optics enhances this systemin two ways. First, these cluster networks are ex-tended from a machine room (100's m) to a campussetting (LAN), and secondly, the multiple WDM chan-nels make it possible to support guaranteed bandwidthservices (e.g., voice, video) concurrently with conven-tional datagram services, with maximum isolation.

2.2 Network RequirementsThe above mentioned applications require that the

supercomputer network provide support for some keyservices. These services include

1. low-latency datagram service, to support finegrain distributed supercomputing. Variable size(as opposed to fixed size) datagram handling is re-quired in order to avoid segmentation/reassemblydelay and overhead in origin and destinationhosts.

2. high-bandwidth, connection oriented service tosupport scientific visualization, large file transfersand more generally, time critical stream transmis-sions.

3. scalable I/O: that is, I/O which is dynamicallyscalable in degree of parallelism from the host in-terface, network fabric, tertiary storage, and theapplication itself. This is particularly importantfor large data flow applications, such as radar orimage processing, where the memory contents ofMPP machines must be exchanged or updated onshort time intervals commensurate with the real-time data source frame rate.

In the SSN project, protocols have been devel-oped both in the high-speed LAN and in the opti-cal backbone, in order to support the above basicservices. SSN is designed to provide a very high-speed interconnection (up to one gigabit per second

SPIE Vol. 2692 / 23

(Gb/s) per channel) that integrates guaranteed band-width service and datagram service. In high-end su-percomputer networks, SSN will also provide the un-denying low latency network fabric for interconnect-ing meta-supercomputer machines with scalable I/O.The CASA gigabit testbed demonstrated the powerof this meta-computing method using a single channelHIPPI network, but lacked the capability of settingup multiple channels between MPP machines on a dy-namic basis over long distances. Hence, finer grainparallel applications with massive I/O requirementscould not be attempted. Likewise, conventional tele-corn services, such as ATM, cannot efficiently providethis high bandwidth on demand without long setupdelay.

2.3 Optical Interconnection OptionsSupercomputer high-speed optical interconnection

is an active area of research. Several optical networkshave been proposed [3], and a few have been or arebeing implemented. Optical WDM network testbedsinclude LAMBDANET [9], Rainbow [7], the All Opti-cal Network (AON) [1] , and Lightning [8]. Two mainalternatives have emerged in the design of WDM net-works: single-hop and multihop networks. Here webriefly outline them since they are important to SSN'sapproach to provide services.

Single-hop networks [14] provide a dedicated,switchiess path between each communicating pair ofnodes. Each two-party communication requires oneparty to be aware of the other's request to communi-cate and to find a free virtual channel over which tocommunicate. This requires frequency-agile lasers anddetectors over a broad range of the optical spectrum.The devices must also be capable of nanosecond reac-tion times. Furthermore, single-hop networks requiresubstantial control and coordination overhead (e.g.rendezvous control and dedicated out-of-band controlchannels). With current technology single-hop net-works are adequate for circuit-switched type traffic,but cannot readily accommodate bursty, short-livedcommunications.

Instead of using a direct path from source to des-tination, multihop networks [15] may require somepackets to travel across two or more hops with buffer-ing at intermediate nodes. Because of random queuefluctuations, multihop networks are not suitable forhigh-throughput, real-time, delay-sensitive traffic. Infact, high data rates require a very streamlined flow

24 / SPIE Vol. 2692

control, which in turn gives limited protection againstcongestion and buffer overflow. Alleviating conges-tion by dropping or deflecting messages is not a de-sirable solution. Dropping messages is problematic insupercomputing since. at high data rates, losing thecontents of even a single buffer (which can exceed 64kilobytes) is potentially disastrous. Deflection rout-ing, on the other hand, introduces unpredictable de-lay and out-of-order delivery, which is intolerable giventhe high data rates used. In summary, multihop net-works are adequate for datagram traffic, but are notwell suited for high speed real time streams.

Like earlier testbeds, SSN employs WDM technol-ogy and high-speed transmission. However, the keyfeature that distinguishes the proposed SSN testbedfrom other approaches is the combination of multiplesingle-hop on-demand circuits with a multihop virtualembedded network to realize a hybrid architecture.This way, both stream traffic and low latency data-gram traffic can be efficiently supported using the ap-propriate transport mechanism. The SSN also allowsfor the dynamic reconfiguration of the multihop topol-ogy by slow retuning of its transceivers.

2 .4 Electronic Interconnection Options

Several electronic supercomputer interconnectionnetworks have been researched in the recent past. TheLos Alamos MCN (Multiple Crossbar Network) con-sists of a collection of Cross Bar Switches connectedby HIPPI channels. High throughput and low la-tency goals are achieved by using a fast packet switch-ing physical layer, connectionless protocol to set up apath for each packet (or group of packets). Scalabilityand high HIPPI connection set up latency are draw-backs of this scheme. Nectar is a network of crossbarswitches that supports both guaranteed bandwidthand datagram service (for small packets < 1 Kbyte).The short packet size and uncontrolled sharing ofbandwidth between the guaranteed bandwidth ser-vice and the datagram service are potential problems.ATOMIC [4] is a network of crossbar switches inter-connected by point-to-point links supporting worm-hole routing. Autonet [19] is also a crossbar basednetwork that supports datagram traffic with low la-tency and high throughput by employing cut-throughrouting. Thus neither network provides support forguaranteed bandwidth service.

2.5 The Integration ChallengeFrom the review of the electronic and optical su-

percomputer interconnection options, we can concludethat each type of network is best suited to certain func-tions. In SSN, for example, the high-speed electronicnetwork provides a low-latency, high-bandwidth data-gram service employing wormhole routing and back-pressure flow control. Support for guaranteed band-width service, multicasting and scalable I/O is limited.The geographic coverage is limited to a few hundredmeters. Also, the congestion caused by traffic imbal-ance cannot be effectively prevented. The SSN opti-cal backbone network supports both guaranteed band-width service (via single-hop) and datagram service(via multi-hop) but with different degrees of imple-mentation complexity. Reconfigurability of the opticalnetwork helps to overcome unbalanced traffic patterns.Again, a robust, distributed reconfiguration algorithmmust be developed.

Thus, the integration of the electronic LAN andthe optical backbone for achieving the design goalsoutlined earlier is a challenging task. The challenge istwo-fold: to design the appropriate hardware interface(optical channel interface (OCI)) and to develop pro-tocols that enable the key requirements of distributedsupercomputer or workstation cluster computing. Forexample, the low-latency service of the electronic LANis extended to the optical backbone by implementingwormhole routing in the backbone as well. Likewise,the backpressure flow control mechanism is extendedto the backbone to prevent loss of data due to bufferoverflows. We describe the design of the OCI and theprotocols in the following section.

3 SSN ArchitectureThe Supercomputer Supernet (SSN) has a two-

level architecture. The lower level consists of ahigh-speed wormhole-routing electronic LAN, calledMyrinet. The higher level of the SSN architectureconsists of an optical WDM backbone network thatinterconnects several lower level LANs. In this sec-tion, we present, in some detail, the architecture ofSSN.

3.1 MyrinetMyrinet [21], manufactured by Myricom, Inc., is a

high-speed, switch-based LAN intended to provide ac-cess over a limited geographical area. Myrinet has itsroots in the multicomputer world [22, 5], where it was

used as the interconnection network for the CosmicCube, a prototype parallel computer. It uses eight-bit—wide data paths between LAN elements, operat-ing at a data rate of 640 Mb/s. The data channelis a full-duplex point-to-point link from a host inter-face to a switching node or between switching nodes.The Myrinet LAN transmits nine-bit symbols, eightof which carry data and one of which carries controlinformation. Thus, in addition to data octets, severalother nondata symbols are possible. The topology isarbitrary, being any configuration of interconnectedhost interfaces and switching nodes. Each switchingnode can have up to 16 ports.

Messages are transmitted in Myrinet as variablesized worms. Worms are composed of flow-control dig-its, flits (in Myrinet, a flit is a byte). Worms can be upto 5.6 MB long. Each worm has a head portion, a dataportion and a tail portion. The head of the worm con-tains a source route (i.e., list of switch port numbers).This information is used by the simple, non-blockingMyrinet switches to make switching decisions. Thecomplete route from the source node to a destinationnode is supplied to the requesting source node by aspecial route-manager software entity. Since Myrinetuses a form of cut-through routing called wormholerouting [16], the head of the message may arrive at itsdestination node before the tail has even left sourcenode. This keeps latency very low. If an in-transitmessage is blocked at a switch, then the progress of theentire message is halted by backpressure. To achievelow latency, these switches switch a message (i.e., pro-cess and route the header) in less than 600 nanosec-onds.

The full-duplex channels use symbol-by-symbolstop-and-go flow control. Special STOP, GO, andIDLE symbols are available for controlling the flowof messages. Every Myrinet host and switch inter-face has a so-called slack buffer, which holds a smallnumber of in-transit symbols. The size of the slackbuffer is enough to hold twice as many symbols ascan propagate simultaneously on a maximum-lengthlink (27 symbols). When the receiving slack bufferhas filled beyond a threshold, the receiver sends aSTOP symbol to halt the incoming flow. Since itcould take up to a full link-propagation delay for theSTOP to arrive, the buffer must be able to absorb atleast two link's worth of symbols beyond the threshold.When the sender receives the STOP, it immediately

SPJE Vol. 2692 I 25

throttles its flow. When the switch's port unbiocksand the slack buffer has drained below another thresh-old, the receiver sends a GO symbol to restart the flow.

The commercial Myrinet already comes equippedwith protocols for the support of datagram service.Source routing and backpressure flow control allow ef-ficient transfer of datagrams in the Myrinet. Myrinetdoes not, however, provide support for guaranteedbandwidth services. Further, multicast is not imple-mented in Myrinet. A source node would have totransmit to each destination node a separate copy ofthe multicast message. To overcome these deficiencies,additional protocols are being implemented in the SSNprogram to provide:

(a) integrated datagram and guaranteed bandwidthservice support

(b) bandwidth allocation to guaranteed bandwidthtraffic

(c) "intelligent", alternate routing to minimize block-ing and reduce latency (more generally "conges-tion management")

(d) multicasting

(e) priority support and QoS enforcement for differ-ent traffic classes.

Some of these protocols have been extensively eval-uated via analysis and simulation. Selected perfor-mance results are reported in section 4.

3.2 Optical WDM FabricThe second level of the Supercomputer SuperNet

(SSN) is an optical backbone network that intercon-nects several high-speed Myrinet LANs. The OpticalChannel Interface (OCI) acts as an interface betweenthe optical backbone and the high-speed MyrinetLANs. The physical optical backbone network can beany architecture — a single passive star coupler, a treeor a star of stars [10]. Space Division Multiplexing(using multiple fibers) is employed too.

The optical backbone network can be viewed as apool of wavelengths. By employing WDM, the opti-cal backbone "virtual" topology is configured to pro-vide support for guaranteed bandwidth traffic, packetswitching, multicasting and broadcasting.

Guaranteed bandwidth service can be provided bydedicating a wavelength between two OCIs after an

26 /SPIE Vol. 2692

arbitration protocol. Datagram service is provided bycreating a virtual multihop network [15]. Any scalablemultihop topology (like the traditional shufflenet [11]or the bidirectional shufflenet [18]) can be configuredon the physical topology.

Since the primary goal of the SSN project is toextend the Myrinet low-latency, high-bandwidth pro-tocols to the optical backbone, the latter must beequipped to support wormhole routing and backpres-sure flow control. The optical backbone can be arbi-trary as long as it satisfies these requirements. Oneconsequence of enforcing backpressure is that bidirec-tional links are required. Thus, the bidirectional shuf-fienet is preferable to the traditional shuffienet.

3.3 WDM Optical Backbone ProtocolsOptical backbone protocols must extend transpar-

ently the Myrinet services end-to-end while achiev-ing efficient utilization (and reallocation) of expen-sive backbone resources, efficient scaling, congestionprotection and fault tolerance. The WDM opticalstar/tree architecture is ideally suited to this set ofrequirements in that it allows high bandwidth in-terconnection, efficient integration of multiple ser-vices and flexible reallocation of channel/bandwidthresources. The optical fabric will be initially a combi-nation of space and wavelength division multiplexing;later in the project, it will be enriched to support alsoT/WDMA (Time and Wavelength Division MultipleAccess).

The following are the key protocols supported inthe optical backbone:

(a) Guaranteed bandwidth protocol. Initially.a separate wavelength/fiber will be allocated toeach connection. When T/WDMA will be avail-able, multiple connections with possibly differentdata rates will be carried in each WDM chan-nel. An efficient technique for supporting multi-rate connections (based on receiver pipelining andslot retuning) was reported in [12].

(b) datagram transfer protocol. Two basic op-tions are available here: namely, the single hopscheme (with wavelength retuning at transmitterand receiver on a datagram by datagrarn basis)and the multihop scheme. In our testbed, wewill pursue the multihop scheme, which is lessdemanding in terms of transmitter/receiver tun-ability. Single hop will be studied via simulation

to provide a term of comparison.

(c) flow control. The Myrinet backpressure typeflow control is extended to the optical backboneby using large slack buffers in the OCIs. Further-more. virtual channels have been defined on eachindividual link of the multihop network, so that asingle backpressured worm does not clog the link.

(d) routing protocol. Two options are available forrouting: the "fiat" source routing option, whichis an end-to-end extension of the Myrinet rout-ing scheme, and the 2-level routing option, whereseparate routing schemes are used for Myrinetand optical backbone. The latter scheme is morescalable, and offers better flexibility in backbonerouting, at the expense of additional implemen-tation complexity (in the OCI). As part of thelatter scheme, deflection routing has also been ex-

plored [13].

(e) multicasting is supported both for guaranteedbandwidth services and datagram transfers. Re-call that guaranteed bandwidth connections aresingle hop ("broadcast and select"). Thus, thesignal can be received by all the OCIs in the mul-ticast group, by simply tuning to the transmissionwavelength at the proper slot. In the multihopnetwork, multicasting can be achieved by defininga proper multicast tree (embedded in the virtualmultihop topology). Dynamic tree maintenancewith receiver initiated join is being considered.Another alternative is an hamiltonian ioop, visit-ing all the OCIs in the multicast group.

(f) priorities are implemented in order to handledatagram traffic with different QoS. Furthermore,priority is given to transit traffic (over entry traf-fic), in order to maintain the multihop backboneclear of congestion, and stop the overload at theentry points.

(g) deadlock prevention. Wormhole routing isprone to deadlocks. A deadlock would bring theentire network to a halt, and therefore must beprevented. The main options are to performup/down routing on a spanning tree and to usevirtual channels to allow deadlock-free shortestpaths. These schemes are discussed and evalu-ated in more detail in [17].

(h) dynamic reallocation of wavelengths to dif-ferent services. The WDM architecture offers theunique opportunity to reallocate bandwidth re-sources to different services (in our case, guar-anteed bandwidth and datagram service) basedon user requirements. Furthermore, the multihopvirtual topology can be dynamically 'tuned" toobtain the best match with the current traffic pat-tern. Previous studies have shown that topologyreadjustments can lead to significant throughputand delay improvements. Initially, the dynamicreallocation and topology tuning will be in termsof actual wavelength. With the introduction ofT/WDMA, finer grain, more efficient reallocationcan be achieved. Protocols for dynamic realloca-tion are currently under development.

3.4 ocI DesignAn important contribution of the SSN project is the

optical channel znerface card. The goal of the OCIcard is a modular design which focuses on the opticalbackbone protocol by using off-the-shelf Myrinet andoptical components where possible.

3.4.1 FunctionalityThe OCI 'bridges' the two levels of the SSN network

architecture. The OCT's basic function is to interfacethe high-speed Myrinet LAN with the optical back-bone network. Thus, the OCI has a Myrinet inter-face and an optical network interface. Since wormholerouting and backpressure flow control are extended tothe optical backbone network, the OCT must providesufficient buffering to prevent data loss due to bufferoverflow. The OCI is responsible for electro-optic andopto-electronic conversions; it provides controls for op-tical transmitters/receivers; it implements the WDMoptical backbone protocols (including arbitration androuting) and, it collects performance measurements.

3.4.2 Hardware DescriptionThe OCT card is a three port device (see Fig. 1).

One port connects to Myrinet. Another port connectsto the optical backbone. The third port is a standard9U VME connection for OCT configuration and systemmonitoring functions. A SPARC 32-bit CPU card andone or more OCT boards will be packaged in a singleVME enclosure to realize an integrated package forinterfacing multiple Myrinet ports to the SSN opticalbackbone. Use of a VME based system allows for easeof integration by using an industry standard backplane

SPIE Vol. 2692 / 27

for OCI integration. The OCI chassis is a 9U VMEcard cage used to house multiple OCI boards and aSPARC based CPU card for monitor and control.

The OCT consists of the dual-LANai circuitry1 usedas a Myrinet destination, the low-level data path mon-itoring logic, the flow control and routing logic, theVME interface, the worm buffers and the WDM fiberoptic transceiver. By using field programmable gaLearrays (FPGA) for the monitor and control logic ofthe OCI, evolution of the design is possible. The de-tailed block diagram of the OCT is shown in Fig. 2.

The dual LANai circuitry is a two port boardwhich uses a pair of LANai processors to emulatea Myrinet source/destination while passing wormsto/from the OCT board in a Myrinet format. Oneof the LANai processors of this circuitry providesthe standard repertoire of Myrinet data and controlbytes used to transmit and receive worms as part ofa Myrinet [20]. The backend of this LANai looks likea DMA engine [20]. To simplify the interface to thebackend of this LANai, a second LANai is used togenerate a Myrinet like port [20] for transmission ofworms to/from an OCT card.

The VME interface circuitry implements the VMEprotocol to communicate with the SPARC card. Thiscircuitry also provides the data, address and inter-

1LANai is a Myricom, Inc. processor that is used in theMyrinet host interface cards.

rupt information pathways from the various OCT logicblocks to a SPARC VME card for configuration of anOCT board and statistic collection operations.

The OCT board also contains worm buffers whichare extensions of the slack buffers in the LANai. Thesebuffers are needed due to the long propagation delayof the fiber optic links of the optical backbone. TheOCT card also provides capability to handle worm pri-orities. This functionality is achieved by using randomaccess memory (RAM) to allow for selection of wormsbased on worm priority. For a simple first-in-first-outbuffering scheme, the delay through these buffers is inthe neighborhood of one microsecond.

The final component of the OCT is the WDM fiber

28 ISPIE Vol. 2692

Eli Custom Hardware

El Semi-Custom Hardware

Figure 1: Optical Channel Interface (OCT) Chassis. Figure 2: OCT board block diagram.

Data Path Monor/CtrI Path

Figure 3: Stepped wavelength laser diode array.

optic transceiver. The stepped wavelength laser diodearrays (see Fig. 3) have four side by side single modeDFB lasers made on a single substrate, with each laseremitting light at a slightly different wavelength in the1.55 ,m region. The laser design is an InP-based ridgewaveguide laser. Due to the simplicity of fabricationand less stringent fabrication tolerances compared toburied heterostructure lasers, ridge waveguide lasersare seen to have a strong potential for commercial use.

The laser wafers were prepared by atmosphericpressure metal-organic chemical vapor deposition(MOCVD) on (100)-oriented n+ InP substrates. Theactive region consists of four compressively strained(e=1%) InGaAsP quantum wells, each 94 Angstromwide. with 150 Angstrom barriers of InGaAsP. Fabri-cation of this material into 4-element DFB laser arraysrequires e-beam writing of the diffraction gratings, anMOCVD regrowth, and the fabrication of the ridgewaveguide structure. The top four layers of the laserstructure (contact, 2 InP layers, and etch stop) areremoved in order to define the distributed feedbackgrating in the SCH region. The pitch of the gratingfor the individual lasers is determined by the modal in-dex and the design criteria of four wavelengths in therange from 1.54-1.56 pm (to be compatible with er-bium doped fiber amplifiers). This leads to four grat-ing pitches in the range from 2375 - 2400 Angstroms.

This hardware provides the capability to transmitor receive at different optical wavelengths. By provid-ing the flow control and routing logic with the capabil-ity to use different wavelengths, multiple simultaneouscommunications over the same fiber optic media forimproved optical backbone throughput and function-ality are possible.

4 Performance ResultsIn this section, we describe the simulation platform

that has been developed for evaluating protocols forSSN. We also present a few selected performance re-sults that are illustrative of the evolutionary method-ology prior to design.

4.1 Simulation PlatformA modular simulator of the SSN has been developed

to experiment with various routing, flow-control, andinterconnection strategies in a hierarchical, reconfig-urable network. The detailed model comprises morethan 5000 lines of Maisie (a C-based message-passingdiscrete event simulation language [2]) code. Innova-

tive features of this simulator, based on the capabili-ties of Maisie, include:

. byte-level rather than packet-level simulation foraccurate modeling of wormhole routing

. dynamic network reconfigurability

. potential for parallel execution to improve perfor-mance

. scalable and modular design

The SSN simulator operates mainly through em-ulation; that is. the operation of the code mirrorsthe actual operation of the network components be-ing simulated. Although not necessarily the most ef-ficient approach, this method is simple to implementand makes modification of the simulator to reflect pro-tocol enhancements straightforward . Another advan-tage of this approach is that it makes it easy to portalgorithms from simulation to actual implementation.

4.2 ResultsThe SSN simulator has been used extensively to

compare and evaluate protocols for the high-speedelectronic network, the optical backbone network andthe entire two-level network. Results from two suchstudies are included here. The first study comparestwo alternative approaches to providing multicast ser-vice support in Myrinet. The second one evaluatesdeflection routing in the optical backbone network asa congestion control technique and contrasts it withthe backpressure flow control approach.

4.2.1 Multicasting in MyrinetAs Myrinet does not implement nmlticasting, we

wish to provide multicasting support that efficientlyutilizes link capacity and host resources. The basicapproach is that the host retransmits received multi-cast worms to other hosts within the same multicastgroup. This approach requires no modification of theMyrinet hardware. Two schemes for such retransmis-sions were investigated:

• Hamiltonian circuit: In this scheme a directedcircuit amongst all hosts in a multicast group isdefined using a graph representing the network(called host connectivity graph). To perform mul-ticasting a source host sends each multicast wormto the next host in the circuit; at this host theworm will pass up through the protocol stack in

SPIE Vol. 2692 / 29

the normal way, and in addition, will be retrans-mitted to the next host in the sequence. If theworm is transmitted back to the originator of themulticast message, this serves as an explicit ac-knowledgement of the successful reception of themulticast message by all hosts of the multicastgroup. This last transmission could be omittedtoo.

. Unrooted Tree: In this scheme multicasting isperformed over an unrooted tree. The tree isformed over the host connectivity graph. Eachnode of the tree represents a host, and an incom-ing multicast worm is retransmitted by the hostto its neighbors in the tree. There is one spanningtree for each multicast group.

A Hamiltonian circuit multicast scheme such as theone just outlined has been implemented. In order toreduce the load on the host's kernel and peripheralbus the retransmission of multicast worms is done en-tirely within the network interface card by using thesoftware that runs on the card's processor. A controlprocess running on the host as a normal applicationinforms the network interface (via its device driver) ofthe pertinent multicast information like the multicastgroup, next host in the multicast group etc. The re-mainder of the operation is performed by the devicedriver (for originating hosts only) and the network in-terface software. This control process, the multicastgroup manager, is currently a stub process but it isexpected to develop into a more complex process thatwill interact with multicast group managers on otherhosts and with the IP group management protocol.Under this implementation, multicast worms are notreturned to their originating host but stop at the pre-vious node in the circuit.

The two alternative schemes have been simulatedusing the SSN simulator. The average multicast la-tency was used as a measure of network loading andperformance. An 8 x 8 torus was simulated. The aver-age worm length was 400 bytes. 10% of the worms gen-erated were multicast worms. Ten multicast groups,each with ten members, were simulated, with themembers chosen at random. The resultant latenciesare shown in Figure 4. These results indicate that theunrooted tree scheme produces less loading of the net-work than the Hamiltonian-circuit—based scheme, asexpected.

Figure 4: Average multicast latency against networkload on 8x8 torus

4.2.2 Congestion Control in the Optical

Backpressure flow control spreads informationabout congestion by 'freezing' the worm in place (pos-sibly stretching across several links and slack buffers

(in switches). In deflection routing, a worm for whichthe required output port is busy, is deflected on a freeoutput port. This can cause out-of-sequence deliveryand unbounded delays. However, deflection routingimplicitly informs hosts about congestion in the net-work by monitoring the traffic at the local switch andobserving the number of hops that the passing wormshave undergone as a result of deflection. A highernumber of hops traveled indicates a higher level ofcongestion. Thus, backpressure flow control and de-flection routing are possible alternative solutions tothe problem of congestion in the network.

Wormhole routing with backpressure flow controlcan lead to deadlocks [6, 17]. Hence, we use the vir-tual channel approach or the Up/Down routing ap-proach [17]. Deflection routing could result in a live-lock, thus we use a deflection buffer (DF) to store someof the worm and probabilistically break livelocks (aprocedure that we call delayed deflection) [13]. Withdelayed deflection, the routing decision on an incomingworm is delayed until an output port becomes avail-able.

We have compared deflection and backpressure, us-ing simulation. The experimental optical network con-figuration is the 8 x 8 bidirectional grid with an aver-age worm size equal to 500. This means that wormsare short with respect to the propagation delay on

30 / SPIE Vol. 2692

12H1004 0.05 0.06 0.07 0.08 0.09 0.1 0.11

O1l9od boO0.12

Backbone

VC 2000 —G—OF 10000 -4—-•

0.00.0

I t---..0_i 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Host load

SPIE Vol. 2692 / 3 1

0.6

0.5

0.4

0.3

0.2

0.1

. - --.

Figure 5: Host throughput for an 8 x 8 bidirectional grid.Average worm size is 500; maximum worm size is 10000.

the links (which is 400 bytes for a 1 Km link withspeed of light assumed to be 2 x 108 m/s). Fig. 5shows host throughputs for this network configura-tion. Seven curves are shown, for different flow controltechniques, and different slack buffer sizes. One curve(labeled VC) refers to the virtual channel approach,with slack buffers of 2000 bytes (slightly more thanfour times the link propagation delay) on each virtualchannel. Three virtual channels per link are needed inthis topology to avoid deadlocks while allowing short-est path routing. Two other curves (labeled UR) re-fer again to backpressure flow control, but with un-restricted shortest path routing, i.e., without virtualchannels or Up/Down routing. In this situation dead-locks occur regardless of buffer size. The remaining 4curves (labeled DF) refer to deflection routing, withdeflection buffer size equal to 0 (pure asynchronousdeflection), 2000, 6000, and 10000 bytes (delayed de-flection), respectively.

The conclusion is that unrestricted shortest pathrouting results in deadlocks and pure asynchronousdeflection gives bad throughput. However, delayeddeflection routing compares quite favorably withthe virtual-channel based deadlock-free shortest pathrouting with backpressure flow control. Similar resultswere obtained for the bidirectional shufflenet topology.This implies that the SSN could use either approachto control congestion in the optical backbone network— the deciding factors would be the implementationcomplexity and tolerance for the drawbacks of deflec-tion routing — unbounded delays and out-of-order de-livery.

Figure 6: The SSN testbed consists of four clusters ofMyrinet switches: four at the UCLA computer science de-partment, two at the electrical engineering department,two in Pasadena (one at JPL and one at Caltech), andtwo at Aerospace Corporation.

5 TestbedThe basic SSN testbed topology is shown in Fig. 6.

The Myrinet switches are placed in three clusters: agroup of four in the UCLA Computer Science depart-ment building, a group of two in the UCLA ElectricalEngineering department building, two at JPL/Caltech(between two supercomputers), and finally, two at theAerospace Corporation. OCIs interconnect the clus-ters as well as selected ports within the largest clusterat the UCLA Computer Science Department.

One fiber optic link segment (14km) of the CASAgigabit network between JPL and Caltech in thePasadena area is proposed as the target SSN testbeddemonstration site using scalable I/O supercomput-ers (see Fig. 7). The proposed SSN application is theUCLA Global Climate Model (GCM) being developedby R. Mechoso for the CASA project. On the presentCASA network, a single HIPPI channel only per-mits a coarse-grain coupling of the ocean/atmospheremodel between the Caltech Intel DELTA (running theocean model) and JPL Cray YMP (running the at-mospheric model). Running over the existing darkfiber, SSN would provide four times the capacity (3.2Gbit/s) and lower latency routing between the two

AEROSPACEWorkstation

Clusters(Visualization)

UCLA _____rshared memory

Super Computer(GCM Physics

ModeCaIRENATM/SONET

T3DH

(GCMAtmosModel) SL

scalable /0

Figure 7: SSN supports a fine-grain distributed supercom-puting and visualization GCM application using Myrineton one segment of the CASA testbed between JPL andCaltech.

supercomputers than the present single HIPPI chan-nel with Crossbar Interfaces. This would provide afoundation for a finer grain decomposition of the GCMapplication . Simultaneously, high performance work-stations can interactively capture image results of therunning GCM model and peruse through new datasets that would be staged for later GCM runs. TheSSN network dynamically allocates/deallocates opti-cal channel bandwidth as workstations or massivelyparallel processor (MPP) nodes enter! leave the net-work. The Myrinet network node also accommodatesinstantaneous reconfiguration of the MPP I/O chan-nels from asynchronous I/O for separate partitionedjobs (e.g., one per quadrant of the MPP) to coher-ently striped I/O for one large single job.

Between UCLA and Pasadena (a distance of about30 miles), and between UCLA and Aerospace (a dis-tance of about 15 miles) the network fabric will con-sist of a point-to-point ATM/SONET OC-3 channelsprovided by the Pacific Bell Ca1REN (California Re-search and Education Network) and GTE consortiums(see Fig. 6). At each location, an OCI can be config-ured to provide a gateway function by incorporating aSONET/ATM network interface with the OCT SPARCCPU controller. A higher performance solution is alsobeing explored with a separate Myrinet/ATM gate-way. Initially, permanent virtual circuits (PVC) willbe used between the sites. Striped channel perfor-

mance, which SSN provides via WDM in a LAN cam-pus setting, can be provided by setting up multiplePVCs and/or SONET OC-3 channels.

6 ConclusionAs fine grain, closely coupled real-time distributed

system applications begin to mature for cluster work-station computing and networking of meta-massivelyparallel processor supercomputers, low-latency rapidlyreconfigurable networks with high Gb/s per channelcapacity will be required. SSN provides one such net-work fabric for binding these systems together thatis easily scalable in both physical size and numberof ports per host. It is also adaptable to a varietyof optical transmission techniques, providing multi-ple growth paths as WDM and spatial optical mul-tiplexing optoelectronics becomes commercially avail-able. Such networks also raise a host of new issuesin network management, flow and congestion control,and error recovery that will be the subject of futurework.

AcknowledgmentThe research described in this paper was carried

out at the Jet Propulsion Laboratory, California Insti-tute of Techno'ogy, the University of California, LosAngeles, and The Aerospace Corporation, and wassponsored by the Advanced Research Projects Agency(ARPA) of the U.S. Department of Defense underContract DABT63-93-C-0055 and University of Cal-ifornia through an agreement with the National Aero-nautics and Space Administration.

Reference herein to any specific commercial prod-uct, process, or service by trade name, trademark,manufacturer, or otherwise, does not constitute or im-ply its endorsement by the United States Government,University of California, Jet Propulsion Laboratory,California Institute of Technology, or The AerospaceCorporation.

S. Alexander et al. A Precompetitive Consor-tium on Wide-Band All-Optical Networks. Jour-nal of Lighlwave Technology, 11(5/6):714—735,May/June 1993.

[2] R. Bagrodia, K.M. Chandy, and J. Misra. AMessage-Based Approach to Discrete-Event Sim-ulation. IEEE Transaclions on Software Engi-neering, 13(6), June 1987.

32 / SPIE Vol. 2692

JPL CALTECH

References[1]

[3] J. A. Bannister, M. Gerla, and M. Kovaëvié.Routing in Optical Networks. In Martha Steen-strup, editor, Rouling in Communicawns Net-works, pages 187—225. Prentice Hall, EnglewoodCliffs, New Jersey, 1995.

[4] D. Cohen, G. Finn, R. Felderman, and A. De-Schon. ATOMIC: A Very-High-Speed Local AreaNetwork. In IEEE Workshop on High-SpeedCommunication systems, Tuscon AZ, February1992.

[5] D. Cohen, G. Finn, R. Felderman, and A. De-Schon. The ATOMIC LAN. In IEEE Workshopon High Performance Commnnicatzons Subsys-tems, Tucson, Arizona, February 1992.

[6] W. J. Daily and C. L. Seitz. Deadlock-Free Mes-sage Routing in Multiprocessor InterconnectionNetworks. IEEE Transactzons on Computers, C-36(5):547—553, May 1987.

[7] N.R. Dono, P.E. Green, K. Liu, R. Ramaswami,and F.F.Tong. A Wavelength Division MultipleAccess Network for Computer Communication.IEEE Journal on Selected Areas in Communica-lions, 8:983—994, August 1990.

[8] P. W. Dowd, K. Bogineni, K. A. Aly, and J. Per-reault. Hierarchical Scalable Photonic Archi-tectures for High-performance Processor Inter-connection. IEEE Transactions on Computers,42(9):1105—1120, September 1993.

[9] M.S. Goodman, H. Kobrinski, M. Vecchi, R.M.Bulley, and J.L. Gimlett. The LAMBDANETMultiwavelength Network: Architecture, Appli-cations, and Demonstrations. IEEE Journal onSelected Areas in Communications, 8:995—1004,

August 1990.

[10] B. Kannan, S. Fotedar, and M. Gerla. A TwoLevel Optical Star WDM Metropolitan Area Net-work. In Proceedings of GLOBECOM 94, pages563—566, November 1994.

[11] M. Karol and S. Shaikh. A Simple AdaptiveRouting Scheme for Shufflenet Multihop Light-wave Networks. In Proceedings of GLOBECOM88, pages 1640—1647, 1988.

[12] M Kovacevic, M. Geria, and J. Bannister. Time

and Wavelength Division Multiple Access with

Acoustooptic Tunable Filters. Fiber and Inte-grated Optics 12(2), pages 113—132, 1993.

[13] E. Leonardi, F. Neri, P. Palnati, and M. Gerla.Congestion Control Techniques in AsynchronousWormhole Routing Networks. Technical ReportCSD-950065, UCLA, 1995.

[14] B. Mukherjee. WDM-based Local Lightwave Net-works Part I: Single-Hop Systems. IEEE Net-work, 6(3):12—27, May 1992.

[15] B. Mukherjee. WDM-based Local Lightwave Net-works Part II: Multi-Hop Systems. IEEE Net-work, 6(4):20—32, July 1992.

[16] L. M. Ni arid P. K. McKinley. A Survey ofWormhole Routing Techniques in Direct Net-works. IEEE Computer, 26(2):62—76, February1993.

[17] P. Palnati, M. Gerla, and E. Leonardi. Deadlock-free Routing in an Optical Interconnect for High-

Speed Wormhole Routing Networks. TechnicalReport CSD-950064, UCLA, 1995.

[18] P. Palnati, E. Leonardi, and M. Gerla. Bidirec-tional Shuffienet: A Multihop Topology for Back-

pressure Flow Control. In Proceedings of 4°' In-ternational Conference on Computer Communi-cations and Networks, pages 74—81, September1995.

[19] M. D. Schroeder, A. D. Birrell, M. Burrows,

H. Murray, R. M. Needham, T. L. Rodeheffer,

E. H. Satterthwaite, and C. P. Thacker. Autonet:A High-Speed, Self-Configuring Local Area Net-work Using Point-to-Point Links. IEEE Journalon Selected Areas in Communications. 9(8):1318—

1335, October 1991.

[20] C. Seitz. Private Communication.

[21] C. Seitz, D. Cohen, and R. Felderman. Myrinet—A Gigabit-per-second Local-Area Network. IEEEMicro, 15(1):29—36, February 1995.

[22] C. Seitz, J. Seizovic, and W. Su. The Designof the Caltech Mosaic C Multicomputer. In Pro-ceedings of the Symposium on Integrated Systems,March 1993.

SPIE Vol. 2692 /33