fault detection and diagnosis. outline fault management functionality event correlations concept...
TRANSCRIPT
![Page 1: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/1.jpg)
Fault Detection and Diagnosis
![Page 2: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/2.jpg)
OutlineFault management functionalityEvent correlations conceptTechniques
![Page 3: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/3.jpg)
Definitions A fault may cause hundreds of alarms. We need to be able to do the following:
Detect the existence of faults Locate faults
An alarm External manifestations of faults
Generated by components Observable, e.g. via messages
An alarm represents a symptom of a fault. An event
An occurrence of interest, e.g. an alarm message
![Page 4: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/4.jpg)
Fault Management FunctionalitiesFault detection
Should be real-time Techniques can be based on active schemes (e.g., polling)
or event-based schemes (where a system component says that it has detected a failure).
Fault location Is it a link or system component or application component?
Determine corrective actionsCarry out corrective actions and determine
effectiveness
![Page 5: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/5.jpg)
Alarm (Event) CorrelationAlarm explosion
A single problem might trigger multiple symptoms (e.g., router is down)
There could be too many alarms for an administrator to handle; Techniques used to help: Compression: reduction of multiple occurrences of an alarm into
a single alarm Count: replacement of a number of occurrences of alarms with a
new alarm Suppression: inhibiting a low-priority alarm in the presence of a
higher priority alarm Boolean: substitution of a set of alarms satisfying a condition
with a new alarm
![Page 6: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/6.jpg)
Fault Diagnosis
Major application of alarm correlation is fault diagnosis
Useful in fault location
![Page 7: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/7.jpg)
Faults and Alarms
f0
f1 f2
A1 A2 A3 A4 A5C1 C2
C3
![Page 8: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/8.jpg)
Faults and AlarmsThe previous figure shows that correlation c1
detects the fault f1 and that correlation c2 detects the fault f2.
Correlating c1 and c2 into the correlation c0 allows the diagnosis of the fault f0.
![Page 9: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/9.jpg)
ExampleLet a1, a2, a3, a4, a5 be alarms generated by
client processes indicating that a client process is not getting a response from a server.
Correlation techniques can be used to show that since a1, a2, a3 were generated by client processes by trying to contact the same server then the server may be the problem. Similar comments apply to a4 and a5.
![Page 10: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/10.jpg)
ExampleFrom the perspective of client processes, the
servers (at the second level of the previous figure) are at fault.
However, it may be observed that alarms were generated by these two servers. Both alarms indicate that each of the two servers are not getting a response and that both were trying to contact the same server. This is another correlation.
![Page 11: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/11.jpg)
Rule-Based ReasoningBased on expert systemsIntended to represent heuristic knowledge as
rules.Components
Knowledge Base (KB): Contains the expert rules that describe the action to be taken when a specific condition occurs e.g., if-then-else
Working Memory(WM): Stores information such as the system/network topology and data collected through the monitoring of application and network components.
![Page 12: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/12.jpg)
Rule-Based ReasoningComponents (continued)
Inference engine: matches the current state (as represented by the monitored data) of the system against the left-side of a rule in the knowledge base in order to trigger the action.
The rules are meant to encapsulate expert knowledge
Why rule-based reasoning? Rules are interpreted which means that rules can be
changed without recompiling. Since expert knowledge can be wrong and/or complete, this
feature is very useful.
![Page 13: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/13.jpg)
ApproachesModel-basedFault propagationModel traversingCase-based reasoning
![Page 14: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/14.jpg)
Model-Based ReasoningUse structural knowledge, e.g., via object-
oriented modelsTo illustrate model-based reasoning, we will
look at network equipment. A network equipment class is used to describe
a network equipment type.Network equipment classes are organized into
a hierarchy using class/subclass relations.
![Page 15: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/15.jpg)
Model-Based Reasoning The root class is a generic class that represents the
most general information common to all network equipment e.g., vendor name
The next level of the hierarchy describes the basic network equipment classes such as physical-trunk class and switch class.
Example: The trunk class refers to T1 trunk class and T3 trunk class.
An association class represents a relationship between two object classes that represents connectivity.
![Page 16: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/16.jpg)
Model-Based ReasoningThe network configuration model is
constructed from the instances of individual network equipment.
The network configuration model is a graph consisting of objects instantiated from the network equipment classes. If there is an edge between two objects then this implies that the two objects are connected.
![Page 17: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/17.jpg)
Example
BGS
MC
WSC
Csd_syslab
Csd_res
Stats
GeneticsGeophys zoo
router
hub
fiber hub
10Base-T
Fiber optic
![Page 18: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/18.jpg)
ExampleThere is a Network Equipment class that has
attributes used to describe a network equipment. At a minimum assume that there is an attribute representing the name of the vendor.
You can subclass the following from the Network Equipment class: Connector Cable Segment
![Page 19: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/19.jpg)
Example The Connector class has two subclasses:
Hub Router
The Cable Segment class also has two subclasses: 10Base-T Fiber Optic
The Hub class has two subclasses: Fiber Hub Local Hub
![Page 20: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/20.jpg)
Example
Network Equipment
Connector CableSegment
Hub Router 10Base-T Fiber Optic
Fiber Hub Local Hub
![Page 21: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/21.jpg)
Model-Based ReasoningMessage Class Hierarchy
Describes messages created by elements of a network configuration model.
These messages report symptoms. They are essentially alarms.
![Page 22: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/22.jpg)
ExampleA hub can generate a message stating that one
of its ports is “bad”. In fact, there is a message associated with each port.
A hub can generate a message stating that the cable segment on a particular port does not seem to be functioning. This can be encapsulated in a message class called Hub_Report_Port_Bad. This can be subclassed into messages generated by each type of hub.
![Page 23: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/23.jpg)
Model-Based ReasoningCorrelations
Conditions under which the correlations are asserted Correlations are stated in terms of message classes. At run-time there is match between an instantiation and a
message class. Example: If a local hub is connected to more than one
router and there are two messages from the routers indicating that there are problems connecting to the local hub then this is a good indication that the hub is a problem.
![Page 24: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/24.jpg)
Fault PropagationBased on models that describe which
symptoms will be observed if a specific fault occurs.
Monitors typically collect managed data at network elements and detect out of tolerance conditions, generating appropriate alarms.
An event model is used by a management application to analyze these alarms.
The event model represents knowledge of events and their causal relationships.
![Page 25: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/25.jpg)
Fault Propogation (Coding Approach) Correlation is concerned with analysis of causal relations
among events. The notation ef is to denote causality of the event f by
the event e. Causality is a partial order between events. The relation may be described by a causality graph whose
directed edges represent causality. Distinguish between fault events (faults or problems) and
symptom events (symptoms). Nodes of a causality graph may be marked as problems (P)
or symptoms (S). Some symptoms are not directly caused by faults, but rather
by other symptoms.
![Page 26: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/26.jpg)
Fault Propagation (Coding Approach)
76
1
8
9
11
3
5
4
10
2
Example Causality Graph
![Page 27: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/27.jpg)
Fault Propagation (Coding Approach)
The correlation problem A correlation p s means that problem p can cause a
chain of events leading to the symptom s. This can be represented by a graph.
![Page 28: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/28.jpg)
Fault Propagation (Coding Approach)
1
9
11
10
2
6
A Correlation Graph
![Page 29: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/29.jpg)
Fault Propagation (Coding Approach) For each fault (problem) p, the correlation graphs
provides a vector that summarizes information available about correlation and symptoms and problems.
This is referred to as the code of the problem. Alarms may also be described using a vector
assigning measures of 1 and 0 to observed and unobserved symptoms.
The alarm correlation problem is that of finding problems whose codes optimally match an observed alarm vector.
![Page 30: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/30.jpg)
Fault Propagation (Coding Approach)Example codes (look at correlation graph
example) 1 = (0,1,1) – This indicates that problem 1 causes
symptoms 9 and 10 2 = (1,0,1) – This indicates that problem 2 causes
symptoms 6 and 10 11 = (0,1,1) – This indicates that problem 11 causes
symptoms 9 and 10.
![Page 31: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/31.jpg)
Fault Propagation (Coding Approach)Example alarm vector
Assume that alarms indicating symptoms 3 and 9 have been observed.
a = (0,1,1)
We can infer that either 1 or 11 match the observation a.
These two problems have identical codes and hence are indistinguishable.
The fault management application may have to do additional tests.
![Page 32: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/32.jpg)
Fault Propagation (Coding Approach)A Codebook is an array of the vectors just
defined. The number of symptoms associated with a
single problem may be very large. Sometimes a much smaller set of symptoms is selected to
accomplish a desired level of distinction among problems.
![Page 33: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/33.jpg)
Fault Propagation (Coding Approach)
Example Codebook p1 p2 p3 p4 p5 p6
1 1 0 0 1 0 1
2 1 1 1 1 0 0
4 1 0 1 0 1 0
![Page 34: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/34.jpg)
Fault Propagation (Coding Approach)
Example Codebook p1 p2 p3 p4 p5 p6
1 1 0 0 1 0 1
3 1 1 0 1 0 0
4 1 0 1 0 1 0
6 1 1 1 0 0 1
9 0 1 0 0 1 1
18 0 1 1 1 0 0
![Page 35: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/35.jpg)
Fault Propagation (Coding Approach)
Distinction among problems is measured by the Hamming Distance between their codes
The radius of a codebook is one half of the minimal Hamming distance among codes.
When the radius is 0.5, the code provides distinction between problems.
![Page 36: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/36.jpg)
Fault Propagation (Coding Approach)Is this easy to apply to application processes?
No
Why Applications are dynamic The coding approach assumes the system is fairly static.
![Page 37: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/37.jpg)
Model Traversing Reconstruct fault propagation at run time using
relationships between objects Begins with managed object that generated event Work best when object relationship is graph-like and
easy to obtain since it must be obtained at run-time Performance Potential parallelism
Weaknesses Lack of flexibility Not well-structured like fault propagation
![Page 38: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/38.jpg)
Model TraversingCharacteristics
Event-Driven: Fault management application is passive until an event arrives. This event is the reporting of a symptom.
Correlation : Decides whether two events result from the same primary fault.
Relationship Exploration: The fault management application correlates events by detecting special relationships between the source objects of those events.
![Page 39: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/39.jpg)
Model TraversingEvent reports should have the following
information: Symptom type Source Target etc
If symptom si’s target is the same as sj’s source then this is an indication that si is a secondary symptom. This allows us to ignore certain alarms.
![Page 40: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/40.jpg)
Model Traversing For each event, construct a graph of objects (models)
related to the source object of that event. When two such graphs touch each other, i.e. contain
at least one common object, the events which initiated their construction are regarded to be correlated. Possibly these two events are the result of the same fault.
If si is correlated with sj and sj is correlated with sk
then through transitivity we can conclude that si is a
secondary symptom.
![Page 41: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/41.jpg)
Model TraversingThe process of eliminating symptom reports
may result in reports that have the same target.Example:
s1 and t s2 and t
It might be necessary to construct possible paths of objects between s1 and t as well as s2 and t
Nodes in common are good candidates for the faults.
![Page 42: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/42.jpg)
Model Traversing We will now discuss the building of graphs The algorithm for building graphs uses relationships
between network hardware and software components to search for the root cause of a problem.
Assumes that information about the relationships between the components are available (e.g., through a database).
Assumes that there are functions including these: getNextHop(source, target,B): Get the node representing the next entity
(that comes after B) in the path between source and target. Note that this may return more than one entity.
![Page 43: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/43.jpg)
Model Traversing
Example Assume the following configuration of processes and
machines. All machines are connected through the Ethernet. P1 is on chocolate; P2 is on peppermint P3 is on vanilla; P4 is on strawberry P5 is on doublefudge; P6 is on mintchip
Communication is through remote procedure calls. This basically requires that all communication go through a daemon process on the server host’s machine. We will call this rpcd
![Page 44: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/44.jpg)
Model Traversing
P4
P3
P5
P1
P2
P6
Call structure is depicted in the following graph:
![Page 45: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/45.jpg)
Model Traversing
Example Assume that P4 terminates abnormally causing a
cascade of timeouts Correlation will result on focusing on these event
reports: (P1,P4) (P3,P4)
Not enough to diagnose the fault. It’s all at the process level. There are still many entities or objects to examine since you do not
want everything generating a message.
![Page 46: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/46.jpg)
Model TraversingExampleStarting with P1 the next component (node)
along the path of the connection between P1 and P4 is identified.
Between P1 and P4 are many entities. We will start out with a vertical search which basically results in the fact that P1 is running on a host machine called chocolate
![Page 47: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/47.jpg)
Model Traversingchocolate is connected to the hub through an
ethernet cable. The hub is connected to strawberry through an
ethernet connection cable where P2 is running. Thus we can say that the path is the following: P1, chocolate,ethernet connection
cable,hub,strawberry,ethernet connection cable, rpcd.strawberry,P4
The path between P3 and P4 is the following: P3, vanilla, ethernet connection cable, hub, ethernet
connection cable, strawberry, rpcd.strawberry, P4
![Page 48: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/48.jpg)
Model Traversing
Example This suggests that we can narrow down the problem to
hub, ethernet connection cable, strawberry.rpcd, strawberry, P4.
At this point, the fault management application may want to poll for additional information. The polling may check to see if something is up or not. An example is applying the ping operation to the host machine called strawberry.
What if every entity is up? This may indicate that strawberry is overloaded. An indication of an overload can be found by measuring the CPU load.
![Page 49: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/49.jpg)
Model TraversingBuilding the graphs requires structural
information and the use of rules.
![Page 50: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/50.jpg)
Model TraversingImplementationWhat management services are needed?
To detect and report symptoms, one could use application instrumentation.
The instrumentation library should most likely talk with a management process (or agent).
The agent sends an event report to the event server. The event server may have a set of rules for symptom
correlation. After correlation, a task may be invoked that does
relationship exploration and the final diagnosis.
![Page 51: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/51.jpg)
Model TraversingImplementationInformation Needed
Information representing the relationships between hardware components and software components is needed.
This needs to be stored in a database or a directory service (e.g., X500)
An API needs to be defined to retrieve this information. Rules can be used to help construct the graph.
![Page 52: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/52.jpg)
Model TraversingImplementationInformation Needed
How is the information collected? Many different techniques. Examples include:
Processes (using instrumentation) may have to register and have their information put into the database.
Network information may have to be entered manually.
![Page 53: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/53.jpg)
Model TraversingSummaryPerforms very quickly once model is built
Model can be constructed incrementally during normal processing; do not have to wait until failure
Can operate in parallelCan accommodate multiple events; different
starting points can result in same problem element
Does require model reflective of run-time One that changes too fast is a problem
![Page 54: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/54.jpg)
Case-based Reasoning (CBR)Objective
Learn from experience Solutions to novel problems Avoid extensive maintenance
Basic idea: recall, adapt and execute episodes of former problem-solving in an attempt to deal with a current problem
![Page 55: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/55.jpg)
Case-based Reasoning
In p u t R e triev e A d ap t P ro cess
C aseL ib ra ry
Approach
![Page 56: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/56.jpg)
Case-Based ReasoningStrategyUseful for domains in which a body of
knowledge with a case structure exists or is easily obtainable
Case structure: Set of fields or “slots” Capture “essential” information
Yield discriminators Set of fields highly correlated with problems or solutions
Need to find “closest” match
![Page 57: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/57.jpg)
Case-based Reasoning
In p u t R e triev e A d ap t P ro cess
C aseL ib ra ry
D isc rim in a to rs
A d ap ta tio nTech n iq u es
U se r-b asedA d ap ta tio n
Adapt
![Page 58: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/58.jpg)
Case-based ReasoningSummaryNeeds well-defined casesLikely to work well when problems are
“close” to existing solutionsProblem selecting solutions when “not so
close” Dangerous in following actions? How to adapt?
![Page 59: Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques](https://reader030.vdocuments.net/reader030/viewer/2022033104/5697bf7a1a28abf838c82bc8/html5/thumbnails/59.jpg)
Summary
Variety of approachesMostly applied in network management
scenarios More controlled? Better understanding of problems?
Limited experience in application management