3/29/05 1 honeywell project update david bueno 3-29-2005

13/29/05

Honeywell Project Update

David Bueno

3-29-2005

23/29/05

Visualization Capabilities Adam has created a

simulation output post-processor to allow us to visualize simulation outputs using Upshot Tells when nodes are

communicating, sending, receiving, etc.

Very useful for script debugging purposes or understanding simulation results

33/29/05

Current Work

Completed most simulation runs for “sensitivity analysis”

Examining simulation results Writing journal paper

Done with GMTI, ~halfway through SAR results Completed reviewing fault tolerance literature

summaries of other group members (each person is doing this) and working on ideas to apply knowledge gained to RIO systems

43/29/05

“Sensitivity Analysis” Altering parameters to possible relevant values due to feedback from Honeywell Clock rate/link width

125MHz or 250MHz, other values are most likely unreasonable GMTI needs more GM ports if lowering clock rate—SAR does not, so we try it both ways

Number of switch ports and system size 6- vs. 8-port

Number of processing/GM nodes Type of flow control (tx- vs. rx-controlled)

Now exposing some differences when we previously found none Comparable system with 6-port switches and 3-node boards Size of switch memory GMTI chunk-based global memory approach

Comparison of GMTI and SAR using similar partitioning and system sizes Increase range of chunk sizes examined for GMTI and SAR Study GMTI algorithm with smaller system sizes and slightly larger cube sizes for

biggest system Relates to fault tolerance, might have cold spare boards hooked up to backplane and not

have access to boards in every backplane slot (i.e. 7 board system might really be a 5 board system)

53/29/05

6-Port Switch System

15 processors Each processor board has three processors and

one 6-port RIO switch (leaving three backplane connections)

3-port global memory, 3-switch backplane

63/29/05

System/Switch Size Variation

Both of these systems have “unused” boards Each system dedicates half its total possible boards to

normal operation (plus 1 GM board)

6-Port Switch (9-Node) vs. 8-Port Switch (16-Node) Systems

150000000

200000000

250000000

300000000

350000000

400000000

1MB 2MB 4MB 8MB 16MB

System-Wide Chunk Size

Co

mp

leti

on

Lat

ency

(n

s)

16-Node Synch Baseline

9-Node Synch

9-Node 2x Buffered

73/29/05

System/Switch Size Variation (2) 16 node vs. 15 node systems

15 node systems created using 6-port switches VERY uneven levels of fault tolerance

15 node system uses all resources possible to achieve these results, 16 node system has 3 unused boards

CPI Completion Latency

150000000

160000000

170000000

180000000

190000000

200000000

210000000

220000000

230000000

240000000

64KB 128KB 256KB 512KB 1MB

Chunk Size Per Processor

Co

mp

leti

on

Lat

ency

(n

s)

Synched Reads 16 Node

2x Buffered 16 Node

Synched Reads 15 Node

2x Buffered 15 Node

83/29/05

Fault Tolerance Literature Search Status Reviewing each other’s summaries and drawing conclusions

about best direction for project Picking best papers to keep as references and get rid of those

that weren’t as helpful Analyze and categorize methods studied for network fault

tolerance Lots of stuff out there for various MINs with small switching elements

—have to figure out how to apply it to more modern networks Determine most effective way to measure reliability/fault

tolerance/cost/etc. from the literature review Ongoing search for additional literature

Created preliminary concept diagram and description Chris is working on a review of key FT concepts from various

books to share with the group

93/29/05

Categories of Fault Tolerant Literature MINs ([6], [12], [13], [14], [15])

Lots of resources, but lots of it is outdated and highly specific to certain network topologies of large numbers of small (ie 2 port) switching elements

Clos network literature applies to our network topology Embedded or real-time networks (especially avionics/space) ([2],

[3], [4], [5], [9]) RapidIO-specific ([7], [8], [16]) Modern SAN/LAN literature ([1], [11])

Infiniband, Extreme Switch, etc. Dynamic load balancing, dynamic spanning trees, etc.

Fault tolerance metrics and methodologies (Ian) ([9]) Upper-level fault tolerance ([10])

103/29/05

Key FT Literature (1)

1. J.M. Montanana, J. Flich, A. Robles, P. Lopez, and J. Duato, "A Transition-Based Fault-Tolerant Routing Methodology For Infiniband Networks," in Processings of the 18th International Parallel and Distributed Processing Symposium, Santa Fe, New Mexico, April 2004.

Infiniband fault tolerance extensions. Also discusses static vs. dynamic fault tolerance and has some good definitions and references. (SAN/LAN)

2. R. Hotchkiss, B.C. O'Neill, and S. Clark, "Fault Tolerance for an Embedded Wormhole Switched Network," in International Conference on Parallel Computing in Electrical Engineering, Quebec, Canada, August 2000.

Embedded network using adaptive routing for fault tolerance with “group adaptive routing” (similar to the Honeywell RIO switch “trunking”). (Embedded networks- non-space systems)

3. H. Olnowich, D. Kirk, "ALLNODE-RT: A Real Time, Fault Tolerant Network," in Proceedings of the Second Workshop on Parallel and Distributed Real-Time Systems, Cancun, Mexico, 1994.

Switched network with alternate paths for fault tolerance. Discusses “hot” and “cold” paths and lets the network choose which path traffic should take. (Embedded networks- real-time systems)

4. S. Chau, J. Smith, A. Tai, "A Design-Diversity Based Fault-Tolerant COTS Avionics Bus Network," 2001 Pacific Rim International Symposium on Dependable Computing, Seoul, South Korea, December 2001.

Spacewire/IEEE 1394 dual bus embedded COTS space system for fault tolerance. (Embedded networks- space systems)5. Paul Walker, "Fault-Tolerant FPGA-Based Switch Fabric for SpaceWire: Minimal loss of parts and throughput per chip lost," Proc. of 4th

International Conference on Military and Aerospace Programmable Logic Devices (MAPLD), Laurel, MD, September 11-13, 2001.Discussion of fault tolerance of Spacewire. Has a good discussion of why to use switched networks rather than bus-based. Has a VERY good discussion of switch/backplane design in terms of bi-section bandwidth and other factors. Page 5 even gives stats which appear to almost directly apply to our backplane design. (Embedded networks- space systems)

6. Nian-Feng Tzeng, Pen-Chung Yew and Chuan-Qi Zhu, "Realizing Fault-Tolerant Interconnection Networks via Chaining," IEEE Transactions on Computers, Vol. 37, No. 4, April 1988.

This paper focuses on a discussion of fault tolerance in MINs through “chaining.” It has a potentially useful discussion on how to calculate relative cost for their scheme in which they calculate “cost” using number of crosspoints in the switching element and number of switching elements, and calculate “cost effectiveness” by taking the ratio of the MTBF to the cost. (MINs)

7. Victor Menasce, "RapidIO as the Foundation for Fault Tolerant Systems," RapidIO Trade Association Application Note, March 2004.This is an excellent application node descriibing how RapidIO may be used to build fault tolerant systems. It lists the six key elements to fault tolerance and provides examples of how RapidIO can provide each. See the extended summary for details on each of these elements. (RapidIO)

113/29/05

Key FT Literature (2)8. “RapidIO Interconnect Specification, Part VIII: Error Management Extensions Specification,” RapidIO Trade Association, September

2002.This RapidIO specification describes a set of extensions for extended error management capabilities. It defines registers used for counting faults as well as for specifying the actions to be taken once the number of faults reaches a certain threshold. Most of the specification deals with the physical layer, although some logical layer functionality is discussed. In addition, there is a small but most likely applicable discussion on hot insertion/removal of devices in RapidIO systems that could apply to fault tolerance. (RapidIO)

9. Citation:P. Irey IV, B. Chappell, R. Hott, D. Marlow, K. O'Donoghue and T. Plunkett, "Metrics, Methodologies and Tools for Analyzing Network Fault Recovery Performance in Real-Time Distributed Systems," Proc. International Parallel and Distributed Processing Symposium (IPDPS), Cancun, Mexico, May 1-5, 2000.

This paper focuses on fault tolerance recovery performance of a Navy ship network for a real-time distributed system. The metrics used are: inter-send time, inter-arrival time, one-way latency, percent data received, and number of duplicates. Their methodology seems to provide a nice starting point for creating one of our own. (Metrics, Real-time networks)

10. Scott Atchley, Stephen Soltesz, James Plank, Micah Beck and Terry Moore, "Fault-Tolerance in the Network Storage Stack," IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems, Ft. Lauderdale, FL, April 2002.

This paper discusses a fault-tolerant network storage stack similar to the OSI model but not directly related. This paper may be a good starting point for upper-level fault-tolerance in the RIO network stack (above the logical layer). The importance of high-level congestion control is mentioned, and while it is not present in our models it is accomplished through carefully scheduled network usage and synchronization. In the final NMP system (RIO or otherwise) it may be necessary to have high-level mechanisms such as those in this paper, which definitely makes this paper a relevant addition to our literature collection. (Upper-level fault tolerance)

11. Extreme Networks, "Leveraging Redundency to Build Fault-Tolerant Networks," White Paper, 2002.This is a whitepaper from Extreme Networks describing how to use their Ethernet switches to build fault tolerant systems. It discusses link aggregation, spanning trees, routing, etc. (Modern SAN/LAN literature)

12. A. Youssef and I.D. Scherson, "Randomized Routing on Benes-Clos Networks," in The New Frontiers, A Workshop on Future Directions of Massively Parallel Processing, McLean, Virginia, October 1992.

This paper describes randomized routing algorithms for circuit-switched crossbar-based Benes-Clos networks. Two of the three algoritms are fault-tolerant, using the “stuck-at” fault model. One of the three algorithms may be used for fault-tolerance in cases where faults can not even be diagnosed (but performance is of course degraded). The paper is an excellent starting point for us, but the network is much different from the networks we are studying. Topology is the key similarity, and randomized routing may certainly be useful in our study of fault-tolerance. (MINs)

13. N. Das and J. Dattagupta, "A Fault Location Technique and Alternate Routing in Benes Network," in Proceedings of the Fourth Asian Test Symposium, Bangalore, India, November 1995.

This paper is not extremely relevant in terms of fault tolerance concepts for our system, but it does have several useful definitions (esp. critical and noncritical fault set) and uses and defines the “switch fault” model. It also mentions the stuck-at fault model and link fault model. (MINs)

123/29/05

Key FT Literature (3)14. M. Bhatia and A. Youssef, "Performance Analysis and Fault Tolerance of Randomized Routing on Clos Networks," in Sixth

Symposium on the Frontiers of Massively Parallel Processing, Annapolis, MD, October 1996.This paper extends off the work performed in [12] for circuit-switched crossbar-based Clos networks using the stuck-at fault model. It allows multiple faults to occur in multiple columns of the Clos network and allows the faults to be worked around without diagnosis. For the method without diagnosis, performance is good for faults in the first or second column, but not as good for faults in the third column or for the case where there are faults in multiple columns. (MINs)

15. Y. Yang and J. Wang, "A Fault-Tolerant Rearrangeable Permutation Network," IEEE Transactions on Computers, Vol. 53, Issue 4, April 2004.

This recent work analytically determines the number of “losing-contact” faults that may be allowed at any of the three stages of a Clos network and still have the rearrangeable condition hold. It also gives routing algorithms and is full of equations and no simulative or experimental results. However, it has many good references and is one of the only recent works on fault tolerance in Clos networks. (MINs)

16. S. Fuller, RapidIO – The Next Generation Communication Fabric for Embedded Application, John Wiley & Sons, Inc., January 2005.This book is an excellent reference for evaluating, understanding, and developing with the RapidIO interconnect. It covers the essentials of the specification, history of RapidIO, usage questions, and many other topics. (RapidIO)

133/29/05

Clos Networks Used for all of our successful designs thus far for

GMTI and SAR Optimal routing very complex, with or without fault-

tolerance Our scheme uses a static routing table

Sufficient for many applications but not all Some form of dynamic load balancing could also

help with fault-tolerance Most Clos Network FT literature ([12],[14],[15])

deals with crossbar switches with “stuck-at” type faults where certain inputs can’t route to certain outputs Also usually for circuit-switched networks Probably does not directly apply to a central-memory

RIO switch Does not apply to total switch failure

Our approach will likely be a combination of alternate paths routing with redundant hardware

Example Clos Network [14]

143/29/05

Fault-Tolerant RIO System Concept Diagram

153/29/05

Concept Diagram Assumptions RapidIO network is capable of detecting known

faulty switches or links and routing around them dynamically

Many cases of a faulty board-switch or node will require replacing the board with one of the spares, or operating that board with reduced capabilities

A backplane switch fault can be routed around with reduced network performance

No need for an entire redundant “network” if backplane is built for FT and redundant cards are available

163/29/05

Definitions “Critical fault set”- A set under which each possible path between

source and destination passes through a switch in the fault set F. (i.e. full-access property is lost) [13] (“noncritical” fault set is the other set)

Static connectivity- Direct path in the Clos network from every source to every destination (graph G is fully connected). [14]

Dynamic full accessibility- Every node can access every other node, but may have to go through another node to accomplish it (graph G is strongly connected). [14]

Rearrangeable network- A switching network that can realize all possible permutations between its inputs and outputs in which a rearrangement to existing connections in the network is permitted when realizing a new connection [15]

173/29/05

Fault Models

Stuck-at fault [12] Crossbar elements stuck, causing certain input ports to

only be able to route to a subset of switch output ports Link fault Switch fault [13]

Totally unusable, outputs disconnected from inputs Losing-contact Fault [15]

Crossbar element cannot make contact (essentially “stuck-at” open)

We will likely develop our own fault model based on feedback from Honeywell Failed port on central memory switch? Failed memory? Switch fault

183/29/05

FT Literature- Important Points to Consider If a failure occurs in a switching element but it

keeps running, erroneous routing and sequencing could occur, which would be hard to detect and recover from

Building systems out of large number of switches as in 5. is nice in the event you lose one, but also should more it more likely to encounter a failure at some point with such a large number

193/29/05

FT Ideas for Honeywell

Randomized routing “Trunking”- can we peek at buffer space for each

output port first? Combination of methods- i.e. if a board switch goes

down, might want to replace the board, but if a backplane switch goes down, work around it

Need several spare cards and at least 1 GM board extra

Create/select “baseline” system with a minimal or zero amount of fault tolerance Then add switches, links, etc. and study effects through

analytical and simulative means

203/29/05

Questions for Honeywell

Static vs. dynamic fault tolerance Both, but primarily dynamic- details from Cliff

posted on project webpage

3/29/05 1 honeywell project update david bueno 3-29-2005

Documents