3/29/05 1 honeywell project update david bueno 3-29-2005
TRANSCRIPT
13/29/05
Honeywell Project Update
David Bueno
3-29-2005
23/29/05
Visualization Capabilities Adam has created a
simulation output post-processor to allow us to visualize simulation outputs using Upshot Tells when nodes are
communicating, sending, receiving, etc.
Very useful for script debugging purposes or understanding simulation results
33/29/05
Current Work
Completed most simulation runs for “sensitivity analysis”
Examining simulation results Writing journal paper
Done with GMTI, ~halfway through SAR results Completed reviewing fault tolerance literature
summaries of other group members (each person is doing this) and working on ideas to apply knowledge gained to RIO systems
43/29/05
“Sensitivity Analysis” Altering parameters to possible relevant values due to feedback from Honeywell Clock rate/link width
125MHz or 250MHz, other values are most likely unreasonable GMTI needs more GM ports if lowering clock rate—SAR does not, so we try it both ways
Number of switch ports and system size 6- vs. 8-port
Number of processing/GM nodes Type of flow control (tx- vs. rx-controlled)
Now exposing some differences when we previously found none Comparable system with 6-port switches and 3-node boards Size of switch memory GMTI chunk-based global memory approach
Comparison of GMTI and SAR using similar partitioning and system sizes Increase range of chunk sizes examined for GMTI and SAR Study GMTI algorithm with smaller system sizes and slightly larger cube sizes for
biggest system Relates to fault tolerance, might have cold spare boards hooked up to backplane and not
have access to boards in every backplane slot (i.e. 7 board system might really be a 5 board system)
53/29/05
6-Port Switch System
15 processors Each processor board has three processors and
one 6-port RIO switch (leaving three backplane connections)
3-port global memory, 3-switch backplane
63/29/05
System/Switch Size Variation
Both of these systems have “unused” boards Each system dedicates half its total possible boards to
normal operation (plus 1 GM board)
6-Port Switch (9-Node) vs. 8-Port Switch (16-Node) Systems
150000000
200000000
250000000
300000000
350000000
400000000
1MB 2MB 4MB 8MB 16MB
System-Wide Chunk Size
Co
mp
leti
on
Lat
ency
(n
s)
16-Node Synch Baseline
9-Node Synch
9-Node 2x Buffered
73/29/05
System/Switch Size Variation (2) 16 node vs. 15 node systems
15 node systems created using 6-port switches VERY uneven levels of fault tolerance
15 node system uses all resources possible to achieve these results, 16 node system has 3 unused boards
CPI Completion Latency
150000000
160000000
170000000
180000000
190000000
200000000
210000000
220000000
230000000
240000000
64KB 128KB 256KB 512KB 1MB
Chunk Size Per Processor
Co
mp
leti
on
Lat
ency
(n
s)
Synched Reads 16 Node
2x Buffered 16 Node
Synched Reads 15 Node
2x Buffered 15 Node
83/29/05
Fault Tolerance Literature Search Status Reviewing each other’s summaries and drawing conclusions
about best direction for project Picking best papers to keep as references and get rid of those
that weren’t as helpful Analyze and categorize methods studied for network fault
tolerance Lots of stuff out there for various MINs with small switching elements
—have to figure out how to apply it to more modern networks Determine most effective way to measure reliability/fault
tolerance/cost/etc. from the literature review Ongoing search for additional literature
Created preliminary concept diagram and description Chris is working on a review of key FT concepts from various
books to share with the group
93/29/05
Categories of Fault Tolerant Literature MINs ([6], [12], [13], [14], [15])
Lots of resources, but lots of it is outdated and highly specific to certain network topologies of large numbers of small (ie 2 port) switching elements
Clos network literature applies to our network topology Embedded or real-time networks (especially avionics/space) ([2],
[3], [4], [5], [9]) RapidIO-specific ([7], [8], [16]) Modern SAN/LAN literature ([1], [11])
Infiniband, Extreme Switch, etc. Dynamic load balancing, dynamic spanning trees, etc.
Fault tolerance metrics and methodologies (Ian) ([9]) Upper-level fault tolerance ([10])
103/29/05
Key FT Literature (1)
1. J.M. Montanana, J. Flich, A. Robles, P. Lopez, and J. Duato, "A Transition-Based Fault-Tolerant Routing Methodology For Infiniband Networks," in Processings of the 18th International Parallel and Distributed Processing Symposium, Santa Fe, New Mexico, April 2004.
Infiniband fault tolerance extensions. Also discusses static vs. dynamic fault tolerance and has some good definitions and references. (SAN/LAN)
2. R. Hotchkiss, B.C. O'Neill, and S. Clark, "Fault Tolerance for an Embedded Wormhole Switched Network," in International Conference on Parallel Computing in Electrical Engineering, Quebec, Canada, August 2000.
Embedded network using adaptive routing for fault tolerance with “group adaptive routing” (similar to the Honeywell RIO switch “trunking”). (Embedded networks- non-space systems)
3. H. Olnowich, D. Kirk, "ALLNODE-RT: A Real Time, Fault Tolerant Network," in Proceedings of the Second Workshop on Parallel and Distributed Real-Time Systems, Cancun, Mexico, 1994.
Switched network with alternate paths for fault tolerance. Discusses “hot” and “cold” paths and lets the network choose which path traffic should take. (Embedded networks- real-time systems)
4. S. Chau, J. Smith, A. Tai, "A Design-Diversity Based Fault-Tolerant COTS Avionics Bus Network," 2001 Pacific Rim International Symposium on Dependable Computing, Seoul, South Korea, December 2001.
Spacewire/IEEE 1394 dual bus embedded COTS space system for fault tolerance. (Embedded networks- space systems)5. Paul Walker, "Fault-Tolerant FPGA-Based Switch Fabric for SpaceWire: Minimal loss of parts and throughput per chip lost," Proc. of 4th
International Conference on Military and Aerospace Programmable Logic Devices (MAPLD), Laurel, MD, September 11-13, 2001.Discussion of fault tolerance of Spacewire. Has a good discussion of why to use switched networks rather than bus-based. Has a VERY good discussion of switch/backplane design in terms of bi-section bandwidth and other factors. Page 5 even gives stats which appear to almost directly apply to our backplane design. (Embedded networks- space systems)
6. Nian-Feng Tzeng, Pen-Chung Yew and Chuan-Qi Zhu, "Realizing Fault-Tolerant Interconnection Networks via Chaining," IEEE Transactions on Computers, Vol. 37, No. 4, April 1988.
This paper focuses on a discussion of fault tolerance in MINs through “chaining.” It has a potentially useful discussion on how to calculate relative cost for their scheme in which they calculate “cost” using number of crosspoints in the switching element and number of switching elements, and calculate “cost effectiveness” by taking the ratio of the MTBF to the cost. (MINs)
7. Victor Menasce, "RapidIO as the Foundation for Fault Tolerant Systems," RapidIO Trade Association Application Note, March 2004.This is an excellent application node descriibing how RapidIO may be used to build fault tolerant systems. It lists the six key elements to fault tolerance and provides examples of how RapidIO can provide each. See the extended summary for details on each of these elements. (RapidIO)
113/29/05
Key FT Literature (2)8. “RapidIO Interconnect Specification, Part VIII: Error Management Extensions Specification,” RapidIO Trade Association, September
2002.This RapidIO specification describes a set of extensions for extended error management capabilities. It defines registers used for counting faults as well as for specifying the actions to be taken once the number of faults reaches a certain threshold. Most of the specification deals with the physical layer, although some logical layer functionality is discussed. In addition, there is a small but most likely applicable discussion on hot insertion/removal of devices in RapidIO systems that could apply to fault tolerance. (RapidIO)
9. Citation:P. Irey IV, B. Chappell, R. Hott, D. Marlow, K. O'Donoghue and T. Plunkett, "Metrics, Methodologies and Tools for Analyzing Network Fault Recovery Performance in Real-Time Distributed Systems," Proc. International Parallel and Distributed Processing Symposium (IPDPS), Cancun, Mexico, May 1-5, 2000.
This paper focuses on fault tolerance recovery performance of a Navy ship network for a real-time distributed system. The metrics used are: inter-send time, inter-arrival time, one-way latency, percent data received, and number of duplicates. Their methodology seems to provide a nice starting point for creating one of our own. (Metrics, Real-time networks)
10. Scott Atchley, Stephen Soltesz, James Plank, Micah Beck and Terry Moore, "Fault-Tolerance in the Network Storage Stack," IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems, Ft. Lauderdale, FL, April 2002.
This paper discusses a fault-tolerant network storage stack similar to the OSI model but not directly related. This paper may be a good starting point for upper-level fault-tolerance in the RIO network stack (above the logical layer). The importance of high-level congestion control is mentioned, and while it is not present in our models it is accomplished through carefully scheduled network usage and synchronization. In the final NMP system (RIO or otherwise) it may be necessary to have high-level mechanisms such as those in this paper, which definitely makes this paper a relevant addition to our literature collection. (Upper-level fault tolerance)
11. Extreme Networks, "Leveraging Redundency to Build Fault-Tolerant Networks," White Paper, 2002.This is a whitepaper from Extreme Networks describing how to use their Ethernet switches to build fault tolerant systems. It discusses link aggregation, spanning trees, routing, etc. (Modern SAN/LAN literature)
12. A. Youssef and I.D. Scherson, "Randomized Routing on Benes-Clos Networks," in The New Frontiers, A Workshop on Future Directions of Massively Parallel Processing, McLean, Virginia, October 1992.
This paper describes randomized routing algorithms for circuit-switched crossbar-based Benes-Clos networks. Two of the three algoritms are fault-tolerant, using the “stuck-at” fault model. One of the three algorithms may be used for fault-tolerance in cases where faults can not even be diagnosed (but performance is of course degraded). The paper is an excellent starting point for us, but the network is much different from the networks we are studying. Topology is the key similarity, and randomized routing may certainly be useful in our study of fault-tolerance. (MINs)
13. N. Das and J. Dattagupta, "A Fault Location Technique and Alternate Routing in Benes Network," in Proceedings of the Fourth Asian Test Symposium, Bangalore, India, November 1995.
This paper is not extremely relevant in terms of fault tolerance concepts for our system, but it does have several useful definitions (esp. critical and noncritical fault set) and uses and defines the “switch fault” model. It also mentions the stuck-at fault model and link fault model. (MINs)
123/29/05
Key FT Literature (3)14. M. Bhatia and A. Youssef, "Performance Analysis and Fault Tolerance of Randomized Routing on Clos Networks," in Sixth
Symposium on the Frontiers of Massively Parallel Processing, Annapolis, MD, October 1996.This paper extends off the work performed in [12] for circuit-switched crossbar-based Clos networks using the stuck-at fault model. It allows multiple faults to occur in multiple columns of the Clos network and allows the faults to be worked around without diagnosis. For the method without diagnosis, performance is good for faults in the first or second column, but not as good for faults in the third column or for the case where there are faults in multiple columns. (MINs)
15. Y. Yang and J. Wang, "A Fault-Tolerant Rearrangeable Permutation Network," IEEE Transactions on Computers, Vol. 53, Issue 4, April 2004.
This recent work analytically determines the number of “losing-contact” faults that may be allowed at any of the three stages of a Clos network and still have the rearrangeable condition hold. It also gives routing algorithms and is full of equations and no simulative or experimental results. However, it has many good references and is one of the only recent works on fault tolerance in Clos networks. (MINs)
16. S. Fuller, RapidIO – The Next Generation Communication Fabric for Embedded Application, John Wiley & Sons, Inc., January 2005.This book is an excellent reference for evaluating, understanding, and developing with the RapidIO interconnect. It covers the essentials of the specification, history of RapidIO, usage questions, and many other topics. (RapidIO)
133/29/05
Clos Networks Used for all of our successful designs thus far for
GMTI and SAR Optimal routing very complex, with or without fault-
tolerance Our scheme uses a static routing table
Sufficient for many applications but not all Some form of dynamic load balancing could also
help with fault-tolerance Most Clos Network FT literature ([12],[14],[15])
deals with crossbar switches with “stuck-at” type faults where certain inputs can’t route to certain outputs Also usually for circuit-switched networks Probably does not directly apply to a central-memory
RIO switch Does not apply to total switch failure
Our approach will likely be a combination of alternate paths routing with redundant hardware
Example Clos Network [14]
143/29/05
Fault-Tolerant RIO System Concept Diagram
153/29/05
Concept Diagram Assumptions RapidIO network is capable of detecting known
faulty switches or links and routing around them dynamically
Many cases of a faulty board-switch or node will require replacing the board with one of the spares, or operating that board with reduced capabilities
A backplane switch fault can be routed around with reduced network performance
No need for an entire redundant “network” if backplane is built for FT and redundant cards are available
163/29/05
Definitions “Critical fault set”- A set under which each possible path between
source and destination passes through a switch in the fault set F. (i.e. full-access property is lost) [13] (“noncritical” fault set is the other set)
Static connectivity- Direct path in the Clos network from every source to every destination (graph G is fully connected). [14]
Dynamic full accessibility- Every node can access every other node, but may have to go through another node to accomplish it (graph G is strongly connected). [14]
Rearrangeable network- A switching network that can realize all possible permutations between its inputs and outputs in which a rearrangement to existing connections in the network is permitted when realizing a new connection [15]
173/29/05
Fault Models
Stuck-at fault [12] Crossbar elements stuck, causing certain input ports to
only be able to route to a subset of switch output ports Link fault Switch fault [13]
Totally unusable, outputs disconnected from inputs Losing-contact Fault [15]
Crossbar element cannot make contact (essentially “stuck-at” open)
We will likely develop our own fault model based on feedback from Honeywell Failed port on central memory switch? Failed memory? Switch fault
183/29/05
FT Literature- Important Points to Consider If a failure occurs in a switching element but it
keeps running, erroneous routing and sequencing could occur, which would be hard to detect and recover from
Building systems out of large number of switches as in 5. is nice in the event you lose one, but also should more it more likely to encounter a failure at some point with such a large number
193/29/05
FT Ideas for Honeywell
Randomized routing “Trunking”- can we peek at buffer space for each
output port first? Combination of methods- i.e. if a board switch goes
down, might want to replace the board, but if a backplane switch goes down, work around it
Need several spare cards and at least 1 GM board extra
Create/select “baseline” system with a minimal or zero amount of fault tolerance Then add switches, links, etc. and study effects through
analytical and simulative means
203/29/05
Questions for Honeywell
Static vs. dynamic fault tolerance Both, but primarily dynamic- details from Cliff
posted on project webpage